Project Management Program Based on Topic Model

Nowadays, a project usually generates a huge amount of documents, and this study designed a topic program based on topic model for project management. Topic model is a machine learning algorithm; it assumes that the documents are the distribution of latent topics and the topics are the distribution of words. The program applies the Latent Dirichlet Allocation (LDA), a popular topic model algorithm, to build the topic model and applies the “prototypicaltext based interpretation” (PTBI) and the visualisation of PyLDAvis to identify the salient topics, the prototypical paragraphs as well as the minimum number of texts for topic interpretation. In this executive summary, I will show you how the program works step by step.

1. Prerequisites

Install the libraries below. Download the two css files from https://github.com/suhao3123/CSS, create a folder named assets in the root of your app directory and include the two files in that folder to lauch the Dashboard we created in the final section.

In [1]:
# pip install numpy                      # (install numpy)
# pip intall pandas                      # (install pandas)
# pip install PyMuPDF                    # (install PyMuPDF for extracting info from PDF files)
# pip install tika                       # (install tika for extracting paragraphs from PDF files)
# pip install spacy==2.2.0               # (install spacy for lemmatization)
# conda install gensim                   # (intall gesim for topic modelling)
# pip install pyLDAvis                   # (install pyLDAvis for topic visulisation)
# conda install -c conda-forge pyldavis  # (if you use aconda to install pyLADvis)
# pip install plotly                     # (install plotly for visualisation)
In [2]:
import pandas as pd
import numpy as np
import re

# glob for extracting the directories of metadata
import glob

# PyMuPDF
import fitz

# tika
import tika               
from tika import parser   

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# spacy for lemmatization
import spacy

# Visualisation
import plotly.express as px
import pyLDAvis
import pyLDAvis.gensim_models
import matplotlib.pyplot as plt
%matplotlib inline

# Enable logging for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)
import os
F:\Anaconda\lib\site-packages\sklearn\linear_model\_least_angle.py:34: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  method='lar', copy_X=True, eps=np.finfo(np.float).eps,

F:\Anaconda\lib\site-packages\sklearn\linear_model\_least_angle.py:164: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  method='lar', copy_X=True, eps=np.finfo(np.float).eps,

F:\Anaconda\lib\site-packages\sklearn\linear_model\_least_angle.py:281: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_Gram=True, verbose=0,

F:\Anaconda\lib\site-packages\sklearn\linear_model\_least_angle.py:864: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_X=True, fit_path=True,

F:\Anaconda\lib\site-packages\sklearn\linear_model\_least_angle.py:1120: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_X=True, fit_path=True,

F:\Anaconda\lib\site-packages\sklearn\linear_model\_least_angle.py:1148: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, positive=False):

F:\Anaconda\lib\site-packages\sklearn\linear_model\_least_angle.py:1378: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  max_n_alphas=1000, n_jobs=None, eps=np.finfo(np.float).eps,

F:\Anaconda\lib\site-packages\sklearn\linear_model\_least_angle.py:1620: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  max_n_alphas=1000, n_jobs=None, eps=np.finfo(np.float).eps,

F:\Anaconda\lib\site-packages\sklearn\linear_model\_least_angle.py:1754: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_X=True, positive=False):

F:\Anaconda\lib\site-packages\sklearn\decomposition\_lda.py:28: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  EPS = np.finfo(np.float).eps

2. Importing documents, data wrangling and overview

Input the directory of the pdf files you want to analyse, the chunks below will extract the texts and info of the files.

In [3]:
# Extract the directories of the PDF files, make sure the folder name does not contain number
pdf_dir = "D:\LEON\Business Analytics\Study\9. Business Project\Data set\Olympics"
pdf_files = glob.glob("%s/*.pdf" % pdf_dir)
pdf_files[:1]
Out[3]:
['D:\\LEON\\Business Analytics\\Study\\9. Business Project\\Data set\\Olympics\\Examination_of_Witnesses_Sept_2003_-_Q1-19.pdf']
In [4]:
# Use PyMuPDF to extract all info of the PDF files (text, title, date, etc)
list_metadata = []
for i in pdf_files:
    with fitz.open(i) as doc:
        info = doc.metadata
        info['file_name'] = os.path.basename(i)
        text = ''
        for page in doc:
            text+= page.getText()
        info['Content'] = text       
    list_metadata.append(info)
In [5]:
df = pd.DataFrame(list_metadata)
df['document_id'] = df.index
df = df.drop_duplicates(subset = ['Content'])             # drop duplicate rows
#df = df.dropna(subset=df.columns[[12]], how='any')        # drop rows whose text content is NaN                   
#df['Word_count'] = df ['Content'].str.count(' ') + 1
df.head(3)
Out[5]:
format title author subject keywords creator producer creationDate modDate trapped encryption file_name Content document_id
0 PDF 1.7 B Lewis Microsoft Word D:20210822083603+00'00' D:20210822083603+00'00' None Examination_of_Witnesses_Sept_2003_-_Q1-19.pdf Examination of Witnesses (1-19) \n16 SEPTEMBER... 0
1 PDF 1.7 B Lewis Microsoft Word D:20210822083606+00'00' D:20210822083606+00'00' None Examination_of_Witnesses_Sept_2003_-_Q20-39.pdf Examination of Witnesses (20-39) \n16 SEPTEMBE... 1
2 PDF 1.7 B Lewis Microsoft Word D:20210822083609+00'00' D:20210822083609+00'00' None Examination_of_Witnesses_Sept_2003_-_Q40-44.pdf Examination of Witnesses (40-44) \n16 SEPTEMBE... 2
In [6]:
# check if there are documents with few words
#min_word_count= 10                                               # set the threshold of the minimum word count of each document 
#min_word_count_filter = df['Word_count'] <= min_word_count
#df_few_words = df[min_word_count_filter][['file_name', 'Content']]
#df_few_words
In [7]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 169 entries, 0 to 168
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   format        169 non-null    object
 1   title         169 non-null    object
 2   author        169 non-null    object
 3   subject       169 non-null    object
 4   keywords      169 non-null    object
 5   creator       169 non-null    object
 6   producer      169 non-null    object
 7   creationDate  169 non-null    object
 8   modDate       169 non-null    object
 9   trapped       169 non-null    object
 10  encryption    3 non-null      object
 11  file_name     169 non-null    object
 12  Content       169 non-null    object
 13  document_id   169 non-null    int64 
dtypes: int64(1), object(13)
memory usage: 19.8+ KB
In [8]:
# Word count
#df['Word_count'].sum( )

3. Natural language processing

3.1. Tokenisation

The texts extracted above will be split into individual words.

In [9]:
data = df.Content.values.tolist()
In [10]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence).encode('utf-8'), deacc=True))  # deacc=True removes punctuations

data_words= list(sent_to_words(data))

3.2. Processing words:

First, the stopwords will be removed and users can add more stop words manually. Next, the bigrams (phrases containing two words) and trigrams (phrases containing three words) will be formed, then the words will be lemmitised (reducing different forms of a word into a single word). Next, a threshold allows users to remove short words.

In [11]:
# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)  
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)
In [12]:
# import the stop_words from gensim
from gensim.parsing.preprocessing import remove_stopwords, STOPWORDS
stop_words = [i for i in STOPWORDS]

# add more stop words after analysing the key words of each topic from pyLDAvis in section 5.2. Topic visualisation 
new_stop_words = ['go', 'would', 'make', 'think', 'take', 'say', 'need', 'want', 'thing', 'have', 'lot', 'people', 'year','good','great','able','come','look','right',
                   'sure', 'day', 'moment', 'work','time', 'know', 'use', 'try', 'happen', 'ask', 'new', 'way', 'jonathan_stephen', 'david_higgin', 'dame_helen_ghosh','end']              
stop_words.extend(new_stop_words)
In [13]:
# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stop_words(texts):
    return [[word for word in doc if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out
In [14]:
# Form Trigrams
data_words_trigrams = make_trigrams(data_words)

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
# python3 -m spacy download en
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

#increase the maximum length of text that the parser or NER can process
nlp.max_length = 13000000 #

# Do lemmatization keeping only noun, adj, verb
data_lemmatized1 = lemmatization(data_words_trigrams, allowed_postags=['NOUN', 'ADJ', 'VERB'])

# Set a threshold for removing the words with length less than the threshold
minimum_len = 3 
data_lemmatized2 = []
for i in data_lemmatized1:
    new_element = [x for x in i if len(x) >= minimum_len]
    data_lemmatized2.append(new_element)

# remove stop words
data_lemmatized = remove_stop_words(data_lemmatized2)

3.3. Dictionary and Corpus

The processed words will be inputted to generate the Dictionary and Corpus to build the topic model. The Dictionary assigns an ID (0, 1, 2, etc.) to each word; the Corpus is a list of (word ID, word frequency) of each document. We can set two parameters to filter out more stopwords as shown below.

In [15]:
# Create Dictionary, set the parameters to filter out tokens in the dictionary by their frequency
no_below = 5             # remove the tokens less frequent than no_below documents (absolute number)
no_above = 0.85          # remove the tokens more frequent than no_above documents (fraction of the total corpus size)
id2word = corpora.Dictionary(data_lemmatized)
id2word.filter_extremes(no_below = no_below, no_above = no_above)

# print the number of reserved unique tokens and word count afer removal of high and low frequency words
print('After removal of high and low frequency words - Number of unique tokens: %d, %d' % (len(id2word),id2word.num_pos))
After removal of high and low frequency words - Number of unique tokens: 3396, 332393
In [16]:
# Create Corpus
texts = data_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

4. LDA Model

Now we can input the Dictionnay and Corpus to bulid the LDA model, a basic and widely-used topic model. We might need to tune the parameters and hyperparameters to get a higher coherence score, a measure evaluating the interpretability of the topics extracted.

4.1. Building LDA Model, Perparameter/Hyperparameter tuning

First we set the training parameters and hyperameters.

In [17]:
# set training parameters and hyperameters
k = 20                  # number of topics
passes = 20             # number of training iterations through the corpus
iterations = 100        # maximum number of iterations through the corpus, limiting this parameter might cause some documents not to converge in time
alpha = 50.0/k          # document-topic density, a high α tends to return more salient topics in each document
eta = 0.01              # prior probabilities assigned to each term
random_state = 12345    # random seed for reproducibility
minimum_probability = 0 # topics with a probability lower than this threshold will be filtered out

Now, we need to plot the coherence score against k to identify the opitmal k where the coherence socre reaches the highest point. Because running it is quite time-consuming, I stopped some chunks below and just set k to be 10 based on the analysis of the reuslt. If users want to fit the model to the other corpus, they can remove the hashs to reactivate the chunks and analyse the coherence socres aginst k.

In [18]:
#start=1; limit=21; step=1 # set the parameters to generate a sequence of k values starting with "start" and ending in "limit" by a step of "step" f
#coherence_values = []
#model_list = []
#for i in range(start,limit,step):
    #model = gensim.models.LdaModel(corpus = corpus,id2word = id2word,alpha = alpha,eta = eta,iterations = iterations,num_topics = i,passes = passes,random_state = 12345,minimum_probability = minimum_probability)
    #model_list.append(model)
    #coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=id2word, coherence='c_v')
    #coherence_values.append(coherencemodel.get_coherence())
In [19]:
#list_num_topics = [i for i in range(start, limit, step)]
#df_coherence1 = pd.DataFrame({'Number_of_Topics': list_num_topics, 'Coherence_Score': coherence_values})
#df_coherence1.to_pickle('./df_coherence1.pkl') #save the result to disk
#df_coherence = pd.read_pickle('./df_coherence1.pkl') #load the result from disk
In [20]:
#fig1 = px.line(df_coherence, x = 'Number_of_Topics', y = "Coherence_Score", title = 'Coherence scores against number of topics')
#fig1.update_layout(autosize=False, width=1000, height=400)
#fig1.update_traces(mode = "lines + markers")
#fig1.show()
In [21]:
# set num of topics to get the highest coherence socre
k = 10
lda_model = gensim.models.LdaModel(
    corpus = corpus,
    id2word = id2word,
    alpha = alpha,
    eta = eta,
    iterations = iterations,
    num_topics = k,
    passes = passes,
    random_state = 12345,
    minimum_probability = minimum_probability)
In [22]:
# print the coherence of the LDA model
coherencemodel2 = CoherenceModel(model=lda_model, texts=texts, dictionary=id2word, coherence='c_v')
coherence_score = coherencemodel2.get_coherence()
coherence_score
Out[22]:
0.4364017884744861

4.2. Topic distribution of documents

Now we can get the topic distribution of documents.

In [23]:
# create the function for converting a list of tuples into a dictionary
def Convert(tup, di):
    di = dict(tup)
    return di
In [24]:
# topic distribution of documents
list_topic = []
dictionary_topic = {}
for d in texts:
    bow = id2word.doc2bow(d)
    belong = lda_model[bow]                        # generate a list of tuples of topic distribution of a document
    belong_dic = Convert(belong, dictionary_topic) # convert the list of tuples into a dictionary
    list_topic.append(belong_dic)           
                      
df_topic_distribution = pd.DataFrame(list_topic)   # convert the list of dictionaries into a dataframe

# rename the topic IDs to ensure they are as same as the topic IDs in the pyLDAvis
original_topic_id = [*df_topic_distribution]; new_topic_id = [x + 1 for x in original_topic_id]
df_topic_distribution = df_topic_distribution.rename(columns = dict(zip(original_topic_id, new_topic_id))) #rename the topic IDs to ensure they are as same as the topic IDs in the pyLDAvis
df_topic = pd.merge(df, df_topic_distribution, how = 'left', left_index=True, right_index=True) # merge with info of documents
df_topic.drop(['title','format','creator', 'producer', 'keywords', 'trapped', 'encryption','subject', 'modDate'], axis = 1)
Out[24]:
author creationDate file_name Content document_id 1 2 3 4 5 6 7 8 9 10
0 B Lewis D:20210822083603+00'00' Examination_of_Witnesses_Sept_2003_-_Q1-19.pdf Examination of Witnesses (1-19) \n16 SEPTEMBER... 0 0.022669 0.012241 0.008059 0.027981 0.009974 0.008838 0.021381 0.346660 0.515521 0.026676
1 B Lewis D:20210822083606+00'00' Examination_of_Witnesses_Sept_2003_-_Q20-39.pdf Examination of Witnesses (20-39) \n16 SEPTEMBE... 1 0.054689 0.031731 0.008668 0.012626 0.032229 0.007006 0.012481 0.306131 0.488593 0.045848
2 B Lewis D:20210822083609+00'00' Examination_of_Witnesses_Sept_2003_-_Q40-44.pdf Examination of Witnesses (40-44) \n16 SEPTEMBE... 2 0.058003 0.024969 0.021409 0.042886 0.040421 0.022138 0.030234 0.301232 0.424574 0.034133
3 Bronwen Lewis D:20210822084116+00'00' Further_supplementary_memorandum_submitted_by_... Further supplementary memorandum submitted by ... 3 0.040255 0.023550 0.608594 0.029913 0.025182 0.093307 0.076155 0.033556 0.037462 0.032028
4 Bronwen Lewis D:20210822083921+00'00' Further_Supplementary_Memorandum_submitted_by_... Further supplementary memorandum submitted by ... 4 0.240306 0.092834 0.056631 0.042153 0.054051 0.064404 0.096172 0.192395 0.106136 0.054918
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
164 Bronwen Lewis D:20210822084528+00'00' Written_evidence_submitted_by_UK_Sport_-_Jan_2... Written evidence submitted by UK Sport \n \n ... 164 0.161374 0.746055 0.008360 0.015019 0.013721 0.006636 0.007021 0.007741 0.023814 0.010259
165 Bronwen Lewis D:20210822084531+00'00' Written_evidence_submitted_by_Vision_2020_UK_-... Written evidence submitted by Vision 2020 UK ... 165 0.010344 0.904470 0.010520 0.008543 0.008640 0.011762 0.010805 0.012390 0.012407 0.010119
166 Bronwen Lewis D:20210822084535+00'00' Written_evidence_submitted_by_VisitBritain_-_J... Written evidence submitted by VisitBritain \n... 166 0.008408 0.065365 0.009910 0.013444 0.185897 0.009476 0.018929 0.013965 0.014569 0.660036
167 Bronwen Lewis D:20210822084543+00'00' Written_evidence_submitted_by_Womens_Sport_and... Written evidence submitted by the Women's Spor... 167 0.039542 0.632806 0.014172 0.011664 0.121685 0.009411 0.039502 0.044871 0.022595 0.063752
168 Bronwen Lewis D:20210822084546+00'00' Written_evidence_submitted_by_Youth_Sport_Trus... Written evidence submitted by Youth Sport Trus... 168 0.009104 0.905684 0.007948 0.013781 0.015472 0.006963 0.010288 0.008576 0.010184 0.012001

169 rows × 15 columns

5. Topic interpretation tools

The tools aim at assisting users to interpret the topics extracted above more efficiently and transparently. we first identify the salient topics defined by PTBI proposed by Marchetti and Puranam (2020), then combine both the topic visualisation of PyLDAvis and the prototypical texts defined by PTBI to facilitate the topic interpretation.

5.1. Salient topics for interpretation

Not all topics can be easily interpreted because topic model is likely to produce more topics than the number a human reader can easily interpret, therefore, PTBI selects only the salient topics for interpretation. For each topic, we need to compute the fraction of documents with the probability that the documents belong to the topic is more than > 1/K (Marchetti and Puranam, 2020, p. 14), and I defined the fraction as the “salience” of the topic.

The scree plot below shows that when the topics are sorted by salience in descending order, the salience tends to reach a low level and level off on topic 6, as a result, we can select the topics ahead of topic 6 as the salient topics for interpretation.

In [25]:
# compute salience: the fraction of documents with the probability that the document belongs to the topic is more than > 1/K for each document
list_percent_above = []
for i in df_topic_distribution:
    num_above = df_topic_distribution[i][df_topic_distribution[i] > 1/k].count()
    percent_above =  num_above/len(df_topic_distribution)
    list_percent_above.append(percent_above)
    
df_salient_topic = pd.DataFrame({'topic_ID':  [str(i) for i in new_topic_id], 'salience': list_percent_above}).sort_values(
    by = 'salience', ascending = False)
In [26]:
fig_L1 = px.line(df_salient_topic, x = 'topic_ID', y = 'salience', title="Scree plot of salience of topics")
fig_L1.update_layout(autosize=False, width=800, height=400)
fig_L1.update_traces(mode = "lines + markers")
fig_L1.show()

5.2. Topic visualisation

I apply PyLDAvis to visualise the topics. The circles on the left panel represent the topics; their areas are proportional to the prevalence of the topics; the distance between topics indicates the similarity between topics. The words on the right panel are sorted by the relevance of the words in a topic, a novel measure for topic interpretation weighted by both the overall word frequency and the estimated word frequency in the topic. The λ on the right-top corner needs to be set to 0.6 to increase the interpretability.

Check the words of each topic, if there're common words with high overall frequency such as "think" "want" or "make", return to the "import the stop_words from gensim" section, add these words to the list of stop words to remove them.

In [27]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word, sort_topics = False )
pyLDAvis.save_html(vis, './assets/lda.html') # save the reult to disk   
vis
Out[27]:

5.3. prototypical paragraphs

The prototypical paragraphs, the paragraphs with the highest probability that they belong to a topic, can be used to assist topic interpretation. This section classifies the paragraphs into topics and provides users with 4 types of filters to select the prototypical paragraphs: N most prototypical paragraphs overall, N most prototypical paragraphs where the belong() function is greater than the threshold L, N most prototypical paragraphs of each topic and N most prototypical paragraphs of a specific topic.

5.3.1. Classifying the paragraphs based on the trained model

Extracting paragraphs from documents

The documents will be separated into paragraphs based on the delimiters representing blank lines. Users can compare the original files to the parsed texts of the files to identify the correct delimiters.

In [28]:
# define the function for spliting documents into paragraphs by delimiters
def para_split(i):
    j = parser.from_file(i)
    m = j['content']
    import re
    return re.split('[?.!-]\n|[?.!-] \n|  \n\n|\n\n[0-9]', m) # users can modify the delimiters
In [29]:
list_paragraphs = []
list_para_id = []
for i in pdf_files:
    para = para_split(i)
    para = [w.replace('\n', '') for w in para]
    para = [x.strip() for x in para if x.strip()] # remove empty elements
    para_id = [x for x in range(len(para))] 
    list_paragraphs.append(para)
    list_para_id.append(para_id)
In [30]:
df_para1 = df.copy()
df_para1['paragraphs'] = list_paragraphs
df_para1['para_id'] = list_para_id
df_para2 = df_para1.apply(pd.Series.explode)
df_para3 = df_para2.reset_index()
df_para4 = df_para3[['creationDate', 'document_id', 'file_name', 'para_id', 'paragraphs']]
# print number of paragraphs extracted
len(df_para4)
Out[30]:
21640

The following chunks allow users to compare the original documents to the parsed texts to check whether the paragraphs are separated correctly; if not, they can modify the delimiters above to seperate the paragraphs again. Remove the hashes to activite the functions.

In [31]:
#df_para4['word_count'] = df_para4['paragraphs'].str.split().str.len()  
#df_para4.sort_values(by = 'word_count', ascending = False)         # sort the paragraphs by word count in descending order, check whether the word count is normal
In [32]:
#para_id = 7                         # input the ibdex of the paragraph that might not be seperated correctly based on its word count
#df_para4.loc[para_id,'paragraphs']  # print the paragraph
In [33]:
#df_para4.loc[para_id,'file_name'] # get the file name of the paragraph
In [34]:
#doc = parser.from_file(r'D:\\LEON\\Business Analytics\\Study\\9. Business Project\\Data set\\Olympics\\Examination_of_Witnesses_Sept_2003_-_Q1-19.pdf') # input the file name
#doc_text = doc['content']
#doc_text                          # get the parsed text of the file
In [35]:
#print(doc_text) # get the original layout of the file, then you can compare it to the parsed text above to identify the correct delimiters

After the paragraphs are seperated, users can set a threshold to filter out the paragraphs with short length such as references.

In [36]:
# set a filter to filter out the paragraphs with short words
n_word_count = 10                                                        # set the threshold of word count
para_word_count = df_para4['paragraphs'].str.split().str.len()           # word count of each paragraph
df_para = df_para4[(para_word_count>=n_word_count)].reset_index()        # select the paragraphs with word count not less than the threshold
df_para
Out[36]:
index creationDate document_id file_name para_id paragraphs
0 2 D:20210822083603+00'00' 0 Examination_of_Witnesses_Sept_2003_-_Q1-19.pdf 2 MS BARBARA CASSANI Q1 Chairman: Good morning, ...
1 3 D:20210822083603+00'00' 0 Examination_of_Witnesses_Sept_2003_-_Q1-19.pdf 3 Ms Cassani: Thank you very much. Thank you ver...
2 4 D:20210822083603+00'00' 0 Examination_of_Witnesses_Sept_2003_-_Q1-19.pdf 4 8 months I shall be able to meet frequently wi...
3 5 D:20210822083603+00'00' 0 Examination_of_Witnesses_Sept_2003_-_Q1-19.pdf 5 The first thing I should like to say is that I...
4 6 D:20210822083603+00'00' 0 Examination_of_Witnesses_Sept_2003_-_Q1-19.pdf 6 Really the backdrop is that I believe in the G...
... ... ... ... ... ... ...
17709 21631 D:20210822084546+00'00' 168 Written_evidence_submitted_by_Youth_Sport_Trus... 37 7.3 When the impact of Olympics and Paralympi...
17710 21633 D:20210822084546+00'00' 168 Written_evidence_submitted_by_Youth_Sport_Trus... 39 11 2007-08 School Sport Survey. 12 As ...
17711 21634 D:20210822084546+00'00' 168 Written_evidence_submitted_by_Youth_Sport_Trus... 40 13 Gold Young Ambassadors work across School...
17712 21635 D:20210822084546+00'00' 168 Written_evidence_submitted_by_Youth_Sport_Trus... 41 14 From national data supplied by Department...
17713 21637 D:20210822084546+00'00' 168 Written_evidence_submitted_by_Youth_Sport_Trus... 43 5 The impact of School Sport Partnerships on...

17714 rows × 6 columns

Process the paragraphs

The paragraphs are processed in the same manners that the documents are processed.

In [37]:
# tokenization
data2 = df_para.paragraphs.values.tolist()
data_words2 = list(sent_to_words(data2))
In [38]:
# Form Trigrams
data_words_trigrams2 = make_trigrams(data_words2)

# Do lemmatization keeping only noun, adj, vb
data_lemmatized2 = lemmatization(data_words_trigrams2, allowed_postags=['NOUN', 'ADJ', 'VERB'])

# set the length of word threshold as same as before for removing the words less than the threshold
data_lemmatized2_2 = []
for i in data_lemmatized2:
    new_element = [x for x in i if len(x) >= minimum_len]
    data_lemmatized2_2.append(new_element)
    
# Remove Stop Words
data_lemmatized2_1 = remove_stop_words(data_lemmatized2_2)
Classify the paragraphs based on the extracted topics

Now we fit the paragraphs to the trained LDA model, and the paragraphs will be classified based on the probability that the paragraphs belong to the topics. Users can drop the meaningless paragraphs after examining the prototypical paragraphs in the next section.

In [39]:
# belong function: classify topics of paragraphs, it might take a long time because there are 148,651 paragraphs in the 11,132,849-word corpus
list_topic_para = []
dictionary_topic_para = {}
for d in data_lemmatized2_1:
    bow = id2word.doc2bow(d)
    belong = lda_model[bow]
    doc_dic = Convert(belong, dictionary_topic_para)
    list_topic_para.append(doc_dic)
    df_topic_para = pd.DataFrame(list_topic_para)
In [40]:
# rename the topic IDs to ensure they are as same as the topic IDs in the pyLDAvis
df_topic_para = df_topic_para.rename(columns = dict(zip(original_topic_id, new_topic_id)))

# topic distribution of paragraphs
df_topic_para1_1 = pd.merge(df_para, df_topic_para, how = 'left', left_index=True, right_index=True)
df_topic_para1_1

# save the result to disk
df_topic_para1_1.to_pickle('./df_topic_para_Olympics.pkl')
# load the result from disk
df_topic_para1 = pd.read_pickle('./df_topic_para_Olympics.pkl') 
In [41]:
# drop the paragraphs with high frequency but meaningless for interperation based on the extraction of prototypical paragraphs below
list_remove_para = [7622, 12966]                    # input the indices of the paragraphs you want to drop after examing the prototypical paragraphs in the next setion
df_topic_para2 = df_topic_para1.copy().drop(list_remove_para) 
df_topic_para2.to_pickle('./df_topic_para_Olympics2.pkl') # save the resuilt to disk

# print topic distribution of paragraphs
df_topic_para2                            
Out[41]:
index creationDate document_id file_name para_id paragraphs 1 2 3 4 5 6 7 8 9 10
0 2 D:20210822083603+00'00' 0 Examination_of_Witnesses_Sept_2003_-_Q1-19.pdf 2 MS BARBARA CASSANI Q1 Chairman: Good morning, ... 0.073004 0.055838 0.059690 0.053839 0.070040 0.055271 0.107584 0.187902 0.265389 0.071442
1 3 D:20210822083603+00'00' 0 Examination_of_Witnesses_Sept_2003_-_Q1-19.pdf 3 Ms Cassani: Thank you very much. Thank you ver... 0.088119 0.080434 0.079274 0.086185 0.086952 0.087036 0.094380 0.166091 0.143715 0.087814
2 4 D:20210822083603+00'00' 0 Examination_of_Witnesses_Sept_2003_-_Q1-19.pdf 4 8 months I shall be able to meet frequently wi... 0.091401 0.090485 0.108595 0.103616 0.084920 0.107042 0.094514 0.127285 0.099105 0.093036
3 5 D:20210822083603+00'00' 0 Examination_of_Witnesses_Sept_2003_-_Q1-19.pdf 5 The first thing I should like to say is that I... 0.051126 0.069764 0.030671 0.064049 0.045791 0.033653 0.045747 0.236413 0.367553 0.055234
4 6 D:20210822083603+00'00' 0 Examination_of_Witnesses_Sept_2003_-_Q1-19.pdf 6 Really the backdrop is that I believe in the G... 0.103006 0.104074 0.060459 0.075865 0.061256 0.052142 0.069883 0.095555 0.273615 0.104144
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
17709 21631 D:20210822084546+00'00' 168 Written_evidence_submitted_by_Youth_Sport_Trus... 37 7.3 When the impact of Olympics and Paralympi... 0.080292 0.341630 0.067202 0.065383 0.097171 0.081064 0.063351 0.058232 0.056719 0.088957
17710 21633 D:20210822084546+00'00' 168 Written_evidence_submitted_by_Youth_Sport_Trus... 39 11 2007-08 School Sport Survey. 12 As ... 0.108606 0.201046 0.084207 0.090331 0.086395 0.067951 0.077790 0.094008 0.087868 0.101798
17711 21634 D:20210822084546+00'00' 168 Written_evidence_submitted_by_Youth_Sport_Trus... 40 13 Gold Young Ambassadors work across School... 0.170898 0.174138 0.078430 0.083826 0.083533 0.076720 0.081847 0.083701 0.078578 0.088329
17712 21635 D:20210822084546+00'00' 168 Written_evidence_submitted_by_Youth_Sport_Trus... 41 14 From national data supplied by Department... 0.095260 0.140983 0.105393 0.088466 0.100969 0.098432 0.099462 0.095923 0.080132 0.094981
17713 21637 D:20210822084546+00'00' 168 Written_evidence_submitted_by_Youth_Sport_Trus... 43 5 The impact of School Sport Partnerships on... 0.090376 0.219336 0.079896 0.084346 0.095948 0.090754 0.085128 0.082485 0.078434 0.093297

17712 rows × 16 columns

5.3.2. N most prototypical paragraphs overall

Print the N paragraphs with the highest probability that they belong to a topic among the corpus.

In [42]:
#N most prototypical paragraphs overall
df_topic_para2_n = df_topic_para2.copy()
df_topic_para2_n['highest_p'] = df_topic_para2_n.iloc[:, 6:].max(axis = 1)           # get the highest probability among the topic distribution of each paragraph
df_topic_para2_n['salient_topic'] = df_topic_para2_n.iloc[:, 6:].idxmax(axis = 1)    # get the corresponding topic id
df_topic_para2_n = df_topic_para2_n[['index','file_name','salient_topic','paragraphs','highest_p',]]
df_topic_para2_n.columns = ['Index','file','topic', 'paragraph','probability']
In [43]:
N1 = 5   # Set N to get the N most prototypical paragraphs overall
df_topic_para2_n.nlargest(N1,['probability']).style.set_properties(subset = ['paragraph'], **{'width':'1000px', 'length': '50px'})
Out[43]:
Index file topic paragraph probability
17552 21447 Written_evidence_submitted_by_Vision_2020_UK_-_Jan_2010.pdf 2 4. Most clubs seem ill-prepared for enquiries from, and inclusion of, people with disabilities who wish to participate in the sport offered by the club. There is little or no support for specialist clubs who provide opportunities for sport that cannot be integrated ie wheelchair basketball or blind cricket! There is often nothing locally to support the child or their parent in accessing the specialist provision and this often involves their having to undertake extensive travel to specialist facilities or organisations catering for this group. There is poor information regarding availability etc and http://www.parasport.org.uk was established as a portal to provide pathway and provision information. The then Mayor of London published a strategy in 2007, which highlighted all these issues and to date there has been little, if any, action to redress these anomalies in London or in the rest of the country. DCSF have, in my view, shown no leadership regarding the legacy of 2012 and its impact on PE in schools and the inclusion of those with disabilities in core curriculum activities or sporting opportunities within or after school. DCMS held a legacy event in 2008 and again in April 2009 focusing on the legacy of the Games for those with disabilities. One outcome was to seek greater links between DWP, DCSF and themselves to ensure that joint strategies were developed and pathways established that enabled children to enjoy and participate in PE and sport within schools/after school clubs/integrated and specialist provision in the community, with good national talent forums and pathways established for those who wish to participate in sport at a higher level and finally, with governing bodies having clear inclusive programmes for sports men and women with disabilities active at a national and international level 0.676578
17551 21446 Written_evidence_submitted_by_Vision_2020_UK_-_Jan_2010.pdf 2 3. After school activities often exclude this group of children ie no possibility of inclusion in team sports such as football, cricket, rugby, basket ball etc., limited access to swimming baths, athletic fields etc. All of these sports are undertaken by people with a disability, but not now normally at school. Specialist schools did provide a massive range of sporting opportunities and sport played a major part of my own adjustment to disability. I learned about teamwork, was able to set individual goals, have competition to extend my abilities, occasionally experienced "being a winner" and had that thrill of competition. One example of efforts to redress this issue within a school setting is in Leeds, which has a programme of monthly sporting and physical activities arranged in school time for vision impaired children within the schools. One solution easily achieved would be for groups of schools to come together monthly to provide sporting and recreational activities for disabled children and young people within their schools 0.627952
3329 3935 NAO_Preparations_for_the_Olympics_-_Progress_report_-_June_2008.pdf 6 6 PREPARATIONS FOR THE LONDON 2012 OLyMPIC AND PARALyMPIC GAMES: PROGRESS REPORT JuNE 2008 10 The start and completion dates for the construction of the main venue and infrastructure projects delivered by the Olympic Delivery Authority at the end of March 2008 compared with the milestones in the November 2007 Programme Baseline ReportProjectEnabling Works (site preparation) Power Lines under Grounding (switchover only) Structures, Bridges and Highways utilities Main Stadium Aquatics Centre VeloparkHandball/Indoor Sports ArenaBasketballInternational Broadcast Centre/ Main Press CentreOlympic Village Eton Manor (training facilities and Paralympic events) Broxbourne (white water canoeing) Eton Dorney (rowing) Weymouth and Portland (sailing)construction start date November 2007 March 2008 Change in programme Forecast start date baseline (months)October 2006 October 2006 0 July 2008 July 2008 0 April 2008 April 2008 0 January 2008 January 2008 0 July 2008 May 2008 –21 September 2008 September 2008 0 March 2009 March 2009 0August 2009 June 2009 –2July 2009 November 2009 4May 2009 March 2009 –2 June 2008 May 2008 –1 March 2010 January 2010 –2 August 2008 May 2009 9 March 2009 January 2009 –2 May 2008 January 2008 –4construction end date November 2007 March 2008 Change in programme Forecast end date baseline (months)September 2009 September 2009 0 September 2008 November 2008 2 December 2011 December 2011 0 December 2011 August 2011 –4 Construction Construction end date end dateFebruary 2011 April 2011 2Completion date Completion date for construction for construction and initial overlay and initial overlay for test events for test eventsJune 2011 June 2011 0Construction Construction end date end dateApril 2011 August 2011 4Completion date Completion date for construction for construction and initial overlay and initial overlay for test events for test eventsJuly 2011 August 2011 1April 2011 February 2011 –2April 2011 March 2011 –1April 2011 April 2011 0June 2011 July 2011 1 December 2011 December 2011 0 February 2012 April 2011 –10 June 2010 October 2010 4 April 2010 July 2009 –9 February 2009 January 2009 –1Source: National Audit Office examination of actual and forecast progress against the November 2007 Programme BaselineNOTE 0.623241
454 520 Jan_2003_-_Qs_200-220.pdf 9 The Committee suspended from 4.09 pm to 4.23 pm for a division in the House Alan Keen 202. I did not get to the end of the question at the beginning but the point I am making is that because we have to have a village and all the events have to be in that area it adds costs to hosting the Olympics, I reckon at least half a billion and probably a billion. If we could spread them round the country—and I went to Japan for the World Cup and the atmosphere was brilliant. We went to different places—more people could get to see it. If we could do that with the Olympics, the point I am really asking you is that it is difficult for the Government. The Minister and the Secretary of State are going to see the President of the IOC on Friday. It will not do our bid any good if they go there telling them how they should organise the Olympic Games in the future. I am really asking you as the main channellers of funding in sport in this country, will you make these representations that the Olympics, just for the sake of having 18,000 athletes in one village, which is very nice, although it is not so nice for those whose event comes on the last day and they want a party—we could save somewhere between half a billion and a billion pounds by using facilities we have got around the country now. The athletics could be at Wembley as they were supposed to be. The football could be at the main stadium and spread around the country as it is going to be in fact. What I am saying is that instead of having the athletes all together in one village for the three weeks of the Olympics, we could put a party on for them and they could stay for a week after the Olympics when they could all get drunk if that is what they do. I think somebody needs to go to the IOC and put this point to them. We have been taking evidence from people in the last couple of days and there are tremendous difficulties. There would hardly be a difficulty if we could use stadia around the country and we did not have to have the village. It is the village that causes all the problems that we are facing now 0.621515
3335 3942 NAO_Preparations_for_the_Olympics_-_Progress_report_-_June_2008.pdf 6 Construction of the Main Stadium and its readiness for test events is critical to the Authority’s delivery programme. At the time of the November 2007 Programme Baseline Report, which was before contract signature, construction of the Main Stadium was scheduled to be completed in February 2011, to be followed by LOCOG overlay works for test events which were to be completed by June 2011. To date good progress has been made on site preparation, which has allowed construction to start in May 2008, two months earlier than planned.1 As a result of contract negotiations the Authority has agreed a longer construction period than anticipated (35 months rather than 32). The Authority has, however, built early access dates for LOCOG overlay works into the contract so overlay can be carried out in parallel with the construction work. The forecast date for readiness for test events remains at June 2011 0.619893

5.3.3. N most prototypical paragraphs where the belong() function is not less than the threshold L

For each topic, print the N paragraphs with the highest probability that they belong to the topic and the probability should be not less than a threshold.

I followed the method of extraction of prototypical text suggested by PTBI (Marchetti and Puranam, 2020. p. 14). PTBI attempts to not only extract the prototypical documents to improve interpretability, but also to find the minimum number of prototypical documents for topic interpretation. The algorithm is shown as follows:

  1. Defines a threshold L (L < ∈ [0,1]). For instance, we set L to be 0.5.
  2. For each topic, select the documents with the probability that they belong to the topic is not less than L (0.5).
  3. For each topic, check whether the number of documents selected is not less than 1/L. For instance, if L = 0.5, for each topic we need at least 2 documents for topic interpretation. This method weakens the limitation that a few documents have a high proportion of a topic is because of randomness.
  4. Compute the percentage of interpretable topics as described in step 3
  5. Change L, keep iterating and find the optimal L with which the percentage of interpretable topics is the highest.
Indenfication of the optimal L and miminum number of paragraphs of each topic for topic interpretation
In [44]:
List_num_doc = [x for x in range(1, 20, 1)] # generate a list of 1/L (minimum number of documents to interpret a topic)
list_L = [1/x for x in List_num_doc]        # generate a list of L
In [45]:
# define the function for computing the percentage of potentially interpretable topics against parameter L
def perc(i, df):
    list_num_topics = []
    for j in df:                                  
        topic_filter = df[j] >= i         
        m = df[j][topic_filter].count()           
        list_num_topics.append(m)                                             
        count1 = sum(map(lambda x : x >= 1/i, list_num_topics))                                     
        perc1 = count1 / k
    return(perc1)

The plot shows that when L = 0.333, the percentage of interpretable topics is 100%, so I set L to be 0.333 - ie, each topic needs at least 3 (1/0.333) paragraphs with the probability that they belong to the topic is no less than 3 for interpretation. It is worth noting that L is inversely proportional to the minimum number of paragraphs of each topic for interpretation (1/L), in other words, the lower the threshold L is, the more paragraphs that users need to interpret the topics. Although when L = 0.1 the percentage of interpretable topics is also 100%, the minimum number of paragraphs of each topic for interpretation also rises to 10 (1/0.1), which increases the workload of interpretation significantly.

In [46]:
list_perc2 = []
for i in list_L:
    num = perc(i, df_topic_para2.drop(columns = ['index', 'creationDate', 'document_id', 'file_name', 'para_id', 'paragraphs']))
    list_perc2.append(num)

df_L2 = pd.DataFrame({'Threshold_L': list_L, 'Percentage of interpretable topics': list_perc2})
fig_L2 = px.line(df_L2, x = 'Threshold_L', y="Percentage of interpretable topics", title = 'Percentage of interpretable topics')
fig_L2.update_layout(autosize=False, width=800, height=400)
fig_L2.update_traces(mode = "lines + markers")
fig_L2.show()
In [47]:
# define the function for extracting the highest N ranked paragraphs from each topic
def top_n_filter(df, top_n):
    list_topic_id = [x+1 for x in range(0,k)]
    list_n_para = []
    list_n_p = []
    list_n_index = []
    list_n_file_name = []
    for x in range(1, k + 1): 
        n_para = [i for i in df.nlargest(top_n, [x])['paragraphs']]
        n_p = [i for i in df.nlargest(top_n, [x])[x]]
        n_index = [i for i in df_topic_para2.nlargest(top_n, [x]).index]
        n_file_name = [i for i in df.nlargest(top_n, [x])['file_name']]
        list_n_para.append(n_para)
        list_n_p.append(n_p)
        list_n_file_name.append(n_file_name)
        list_n_index.append(n_index)
    pd_n_para = pd.DataFrame({'Index':list_n_index, 'topic_id': list_topic_id, 'file': list_n_file_name, 'paragraph': list_n_para, 'probability': list_n_p})
    return(pd_n_para.apply(pd.Series.explode).reset_index().drop('index', axis = 1))

Below we get the the 3 most prototypical paragraphs of each topic when we set the optimal L to be 0.333.

In [48]:
L = 1/3 # set the optimal L based on the analysis above
top_n_filter(df_topic_para2, int(1/L))[top_n_filter(df_topic_para2, int(1/L))['probability'] >= L].style.set_properties(subset = ['paragraph'], **{'width':'500px', 'length': '50px'})
Out[48]:
Index topic_id file paragraph probability
0 9932 1 Qs_202_-_238.pdf Lord Moynihan: Adrian, I am looking to the future and in this new found position of being apolitical and outside criticism of what may or may not have happened in the past I can tell you what is required moving forward is significant improvements rather than dwelling on what went wrong in the past. The first thing that is required is consistent funding. It is absolutely essential that we have consistent funding to the Olympic governing bodies. The second thing that is required is hiring the best coaches. I mentioned earlier that coaches are centres of excellence, and they really are. The coach inspires, the coach motivates, the coach can get the best out of our young Olympians of the future. We will deliver far better than we have in the past. The importance given to coaching needs to be on a step change level from where it has been in the past. I am very pleased to say that UK Sport and the Government are at one with us on the importance of reinforcing more emphasis on coaching. Then I mentioned in a highly competitive market, which it will be, to achieve fourth place we need to make sure not just the coaching but the management, the administration, the medical, the support services to our sports men and women are there and that we have the back-up facilities, the high performance centres properly resourced and that the governing bodies are given the support they need to make the decisions on behalf of their staff, their performance directors, their coaches. Ultimately we are all the servants of the sports men and women who will be up there winning medals in 2012. I am absolutely convinced that with the strength of the governing bodies and the talent we have in this country that is an achievable and deliverable target. I am absolutely convinced that unless Clearing the Bar is accepted we will not achieve that stretch target. That is why we are spending so many hours, the days and nights and little time we have available getting that document right and working with UK Sport to make sure we get that right. We have to do it now, we cannot afford to wait until the end of next year. We need to make sure that the funding is secure so that governing bodies can take the steps which are necessary next year. The time invested in the beginning of next year will be worth a huge amount when we get to 2012. It is time invested upfront to make sure that those plans can deliver the success. That means we will look at Beijing in a different light from how we would normally look at Beijing. We will look at Beijing as a stepping stone towards success in 2012 which may not best be judged by medal tallies, for example, but must be judged by how much progress the squads, the teams and the coaches have made en route from here to 2012; which will be our principal goal 0.615441
1 3639 1 NAO_Preparing_for_sporting_success_-_March_2008.pdf 4 UK Sport’s ‘ultimate goals’ for medal success at the London 2012 Games will require a step change in performance amongst elite athletes. The achievements of athletes at recent elite international events in a number of sports, including sailing, cycling, rowing, boxing, disability equestrian and disability shooting, suggest that performance levels in some sports are already improving significantly. Following increased spending on elite sport, host nations can typically expect to win an extra six or seven gold medals at an Olympic Games and to win medals across a wider range of sports. This ‘host nation effect’ would not in itself be enough to deliver UK Sport’s Olympic goal, which is likely to require an improvement of eight or nine gold medals over the Great Britain team’s performance at the Athens Games in 2004 if the relative performance of other nations remained the same. Changes in the performance of other nations since 2004, especially in the context of a general trend of increased spending on elite sport, sometimes referred to as a ‘global sporting arms race’, may also have implications for UK Sport in delivering its medal aspirations 0.556173
2 9912 1 Qs_202_-_238.pdf To reach the figure that we will need to conclude our work in the next four weeks and, we will continue to work very closely indeed with UK Sport. UK Sport has been very active in working with the British Olympic Association, going round to see the governing bodies, using a performance-based model which has been a very constructive and critical model in the context of working out the funding requirements. I think it is best summarised as work in progress. We are about three-quarters of the way through the discussions with the governing bodies; we have more to complete. Clearly within each governing body, Olympic governing body, there is an elite performance cell. That elite performance cell has a performance director associated with it and the performance director is critical to this process, as are the coaches. It is essential as far as the British Olympic Association is concerned that when Clearing the Bar is presented to the Olympic Board and subsequently presented to Government that it is agreed by all the summer Olympic governing bodies. There will be no point in coming to a figure that says collectively we can achieve fourth place in the Olympic medal table in 2012, if suddenly hockey or athletics, for the sake of argument, woke up the following morning and said "Hang on a second, we are not going to be able to contribute in the way you would like us to do on that budget". It would need to be robust, it would need to be capable of detailed analysis by this Committee and other committees in Parliament. It requires a significant amount of work which is underway at the moment. I emphasise that it is being undertaken in partnership with UK Sport, that is right and proper as they have significant expertise which has been very helpful to us in this process. We are on target to complete that work, Chairman, and we intend to make sure that it is presented on time. It is a budget, as I say, that must be robust and bought into by the Olympic governing bodies which ultimately will be responsible for performance on the day. The final point I would make in answer to your question, Mike, is that consistent funding is essential. We cannot have the situation whereby a governing body receives funding one year and maybe gets what they are expecting in year two but then loses out in funding in year three. If we are going to compete to come fourth, and we believe that is a realistic target to achieve, it is a tough stretch but it is realistic—it should be a tough stretch but it must be realistic—then we need consistent funding over the next six years. That is absolutely essential. If we are going to contract with the best coaches in the world, if we are going to provide the best sport facilities, that base line budget must be agreed and the governing bodies must be confident that there will not be a move away from that base line budget in recruiting the staff necessary to move from tenth to fourth in the medal table. That is the current position. We are working hard both within the BOA as well as with UK Sport and with outside experts to make sure that model is robust and that Clearing the Bar will achieve not only what it says but be widely accepted by your Committee and the sporting world in this country 0.540828
3 17552 2 Written_evidence_submitted_by_Vision_2020_UK_-_Jan_2010.pdf 4. Most clubs seem ill-prepared for enquiries from, and inclusion of, people with disabilities who wish to participate in the sport offered by the club. There is little or no support for specialist clubs who provide opportunities for sport that cannot be integrated ie wheelchair basketball or blind cricket! There is often nothing locally to support the child or their parent in accessing the specialist provision and this often involves their having to undertake extensive travel to specialist facilities or organisations catering for this group. There is poor information regarding availability etc and http://www.parasport.org.uk was established as a portal to provide pathway and provision information. The then Mayor of London published a strategy in 2007, which highlighted all these issues and to date there has been little, if any, action to redress these anomalies in London or in the rest of the country. DCSF have, in my view, shown no leadership regarding the legacy of 2012 and its impact on PE in schools and the inclusion of those with disabilities in core curriculum activities or sporting opportunities within or after school. DCMS held a legacy event in 2008 and again in April 2009 focusing on the legacy of the Games for those with disabilities. One outcome was to seek greater links between DWP, DCSF and themselves to ensure that joint strategies were developed and pathways established that enabled children to enjoy and participate in PE and sport within schools/after school clubs/integrated and specialist provision in the community, with good national talent forums and pathways established for those who wish to participate in sport at a higher level and finally, with governing bodies having clear inclusive programmes for sports men and women with disabilities active at a national and international level 0.676578
4 17551 2 Written_evidence_submitted_by_Vision_2020_UK_-_Jan_2010.pdf 3. After school activities often exclude this group of children ie no possibility of inclusion in team sports such as football, cricket, rugby, basket ball etc., limited access to swimming baths, athletic fields etc. All of these sports are undertaken by people with a disability, but not now normally at school. Specialist schools did provide a massive range of sporting opportunities and sport played a major part of my own adjustment to disability. I learned about teamwork, was able to set individual goals, have competition to extend my abilities, occasionally experienced "being a winner" and had that thrill of competition. One example of efforts to redress this issue within a school setting is in Leeds, which has a programme of monthly sporting and physical activities arranged in school time for vision impaired children within the schools. One solution easily achieved would be for groups of schools to come together monthly to provide sporting and recreational activities for disabled children and young people within their schools 0.627952
5 17553 2 Written_evidence_submitted_by_Vision_2020_UK_-_Jan_2010.pdf 5. Integration of disabled people into mainstream sporting provision is a concept rather than a reality. Grants to organisations has largely been based on physical access, rather than actual provision of activities, coaching etc. For some it has been a tick box action, rather than an attempt to include and integrate their provision. In order to create a lasting legacy from 2012, this offer of inclusion has to be both genuine and meaningful, ie accessible facilities with no manmade barriers re attitudes, lack of coaching or energy to find solutions regarding sporting opportunities. These are all major threats to establishing a broad base of participation pyramid with hundreds of disabled people participating in sport at the base level, filtering through at representative level for club and nationally, leading on to the tip of the pyramid—international competition at Paralympic level. This pathway does not currently exist for those with a disability and therefore, a different solution needs to be found that provides a separate provision (where appropriate) and uses the mainstream provision (where appropriate) 0.594271
6 2929 3 NAO_Preparations_for_the_Olympics_-_Progress_report_-_Feb_2010.pdf Figure 2The Delivery Authority’s November 2007 baseline for the Village and Media Centre and its December 2009 forecastall figures in £ million, inclusive of Vat november 2007 baseline Revised budget approved by funders in 2009latest forecast of potential costs december 2009Village and Stratford City land and infrastructure Land and infrastructure – public sector funding 522 553 560Estimated share of profits from post Games development (250) (250) (100)1Net Olympic Delivery Authority budget 272 303 460Village development Village construction and sales (gross) costs 0 1,095 1,1262Agreed sale of 1,379 units to Triathlon Homes 0 (268) (268)Estimated receipts from private sales – to fund construction 0 (177) (177)Village development – sub total 0 650 681Estimated receipts from private sales – to repay funders 0 (324) (324)Net Olympic Delivery Authority budget 0 326 357Media Centre3Media Centre 220 355 334Net Olympic Delivery Authority budget 220 355 334Total net cost to the Olympic Delivery Authority 492 984 1,151Source: National Audit Offi ce noteS1 See paragraph 4.10 0.602559
7 7679 3 PAC_-_Risk_assessment_and_management_-_July_2007_report.pdf 6. At the time of our hearing, a number of significant areas of uncertainty remained before the budget could be finalised.18 Tax – The Department explained that tax had been excluded from the cost estimates at the time of the bid on the basis that the tax treatment could not be resolved until the delivery structures for the Games were in place.19 Contingency provision – The cost estimates at the time of the bid had included a contingency provision in respect of individual projects, but the Department now considered that an overall programme contingency margin was required to reflect the interdependencies between projects and the risks associated with the knock-on effect that problems on one project could have on the rest of the programme.20 Security – It had not been possible at the time of the bid to provide a reliable estimate of the costs of policing and wider security, and the Department had notified Parliament of a contingent liability in this respect. The Olympic Security Co-ordinator was now working up plans and budgets in association with the Home Office and the Metropolitan Police.21 Private sector funding – At the time of the bid, £738 million of private funding had been expected to help meet costs associated with the Olympic Park. In the light of further work and advice, the Department now considered there was insufficient time to negotiate contracts with the private sector within the overall timescale for the Games, so there was now little prospect of securing significant private sector funding to deliver the Olympic Park. However, most of the funding for the Olympic Village was still expected to come from the private sector.22 0.596071
8 2652 3 NAO_Budget_for_2012_Olympics_-_July_2007.pdf 7 The Secretary of State announced a funding package of £9.325 billion to cover the costs and provisions set out in Figure 5, an increase of £5.906 billion on the public funding of £3.419 billion17 previously committed. (Figure 8) The contributions from the National Lottery and the Greater London Authority have increased, but the bulk of the additional funding is to come from the Exchequer, in line with the Government’s commitment to underwrite the cost of the Games. The costs to be covered by this increase in funding include £1.173 billion of tax18 which will ultimately flow back to the Exchequer. The Department has confirmed to us that the tax liabilities associated with the Games will be met entirely from Exchequer funding19, which means that ultimately the net increase in public sector funding is £4.733 billion. The funding increase of £5.906 billion includes contingency of £2.747 billion which the Department has made clear to us may not be used in full 0.549523
9 2002 4 Memorandum_submitted_by_LOCOG_-_Nov_2007.pdf Chief Medical Officer Creative Director Director of Strategy and Programme Management Head of Procurement Head of Client Services Head of Education Head of Venues Technology Head of Programme Solutions Head of Workforce Planning Head of Accommodation Head of Sport Policy Head of Culture Head of Telecommunications Head of Ceremonies Head of Administrative IT Head of Live Site Head of Venue Management Head of Ticketing — We now have in place the core components required to undertake the detailed planning for the Games. The team however remains small, at just under 200 people and our recruitment is planned carefully on the basis of the core work that needs to be undertaken now. 0.553650
10 17005 4 Written_evidence_submitted_by_Olympic_Park_Legacy_Company_-_Feb_2010.pdf 10. The Olympic Park Legacy Company's strategic objectives, as set by its Founders cover: — assisting the Government and the Mayor of London in fulfilling some of the legacy promises made in the bid to host the London 2012 Olympic and Paralympic Games; — securing the timely development of the Olympic Park site as a high quality and sustainable mixed community; — promoting social, economic and environmental benefits for local communities; — securing the long term development and management of the Olympic Park site and venues in ways which provide lasting national and local sporting, cultural, education and leisure benefits and which preserve the site's Olympic heritage; — working with partners to contribute to long-term economic growth and prosperity in the wider area; — levering in private investment to maximise and provide best value for the public purse; and — promoting sustainable development, community involvement and equality of opportunity 0.490745
11 2285 4 Memorandum_submitted_by_the_Olympic_Delivery_Authority_-_Oct_2006.pdf Maximising the value of the Olympic legacy Delivering a sustainable legacy is central to the Olympic Delivery Authority's work. As set out above, we have reviewed our master plans to ensure that the Olympic Park delivers a great legacy, as well as great Games in 2012. We now have plans that are better integrated with Stratford City and fit in better with local regeneration plans. For each of our venues, we are developing the legacy plans in significant detail, so that before we start building we have an affordable business plan for their after-use. For example: — We have confirmed proposals for a new tennis and hockey centre in the north of the site. Combined with the velopark legacy south of the A12, this will create a sporting anchor, complementing the upgraded provision of sporting facilities on Hackney Marshes 0.485637
12 16483 5 Written_evidence_submitted_by_Host_Boroughs_Unit_-_Feb_2010.pdf 4.1 The aim is that in the next 20 years, residents in the host boroughs will equal the London average in a range of the life indicators which you would expect to find in a successful community: — employment rates will increase to the London average; — average incomes in the bottom two fifths of earners in the host borough area will be increased to the London average; — young people in the host borough area will have improved GCSE results to at least the London average; — host borough 11 year olds will have at least the same educational attainment as the London average; — the number of families in receipt of benefits in the host boroughs area will fall to no more than the London average; — the rate of violent crime will continue to fall and reflect the London average; and — residents in the host boroughs area, particularly men, will have increased life expectancy to the London average 0.566613
13 4947 5 Oral_evidence_-_17_March_2010_Qs_100-143_-_Boroughs.pdf Sir Robin Wales: It is worth making a comment here. It has been very interesting watching this because Jules has led very much with a vision for this place, which the five boroughs have supported but Jules has driven that vision. It is as we have gone on and people have begun to connect with the SRF and understand what we are trying to do that people have begun to realise we are now with the OPLC, which I think is people understanding that legacy is important and beginning to line up behind the vision that Jules has pushed extensively because he understands the nature of his community and how that might work and how it will relate. I think it is a really good example of something being pushed by a local borough, backed by the rest of us, looking to have a vision that will make a difference there and will link in with the community he has got. It comes back to heroic economic assumptions. What comes out at the end will come out, but at least we are trying to do something that will deliver, something that will work for the local area, based on the vision that we have had locally and people are now beginning to line up to. So the question now is: do we get people lined up to support us on public policy and then the jobs that will come out will be the jobs that come out and they will begin to make a difference, particularly to Hackney but also to some of the people in Newham who will be able to access that, and other boroughs. It is a really good example of how the vision has been led by boroughs and people are now getting it 0.536632
14 4925 5 Oral_evidence_-_17_March_2010_Qs_100-143_-_Boroughs.pdf We have then a development site sitting there with the Park and the opportunities, and it is the best development site in Europe, certainly, and possibly further afield, and you have opportunities for jobs there all down the Lee along the Thames and the Royal Docks. The Mayor of London has said that for the next 20 years that is the hub of the regeneration of London. There are lots and lots of jobs coming there and so we need to make sure that those jobs result in people who have never worked getting into those jobs; local people getting access to those jobs and then trying to keep them locally. Very interesting, in Newham we did some research and people moving into Newham are poorer than people moving out of Newham. People who are unemployed, people who are not working, move in, get jobs and move out and we need to find a way of getting some stability around those communities, but there are tens of thousands of jobs coming. Many of them will be unskilled but I have to say that there is also an argument for having high skilled jobs so that there is an aspiration; so it is about that balance. Jules has fought very hard to get some high skilled jobs in Hackney, and those jobs are coming. Can we get people ready to access those jobs? Can we get people who will go in and take those jobs rather than sucking people in from elsewhere, that is the challenge 0.530890
15 3329 6 NAO_Preparations_for_the_Olympics_-_Progress_report_-_June_2008.pdf 6 PREPARATIONS FOR THE LONDON 2012 OLyMPIC AND PARALyMPIC GAMES: PROGRESS REPORT JuNE 2008 10 The start and completion dates for the construction of the main venue and infrastructure projects delivered by the Olympic Delivery Authority at the end of March 2008 compared with the milestones in the November 2007 Programme Baseline ReportProjectEnabling Works (site preparation) Power Lines under Grounding (switchover only) Structures, Bridges and Highways utilities Main Stadium Aquatics Centre VeloparkHandball/Indoor Sports ArenaBasketballInternational Broadcast Centre/ Main Press CentreOlympic Village Eton Manor (training facilities and Paralympic events) Broxbourne (white water canoeing) Eton Dorney (rowing) Weymouth and Portland (sailing)construction start date November 2007 March 2008 Change in programme Forecast start date baseline (months)October 2006 October 2006 0 July 2008 July 2008 0 April 2008 April 2008 0 January 2008 January 2008 0 July 2008 May 2008 –21 September 2008 September 2008 0 March 2009 March 2009 0August 2009 June 2009 –2July 2009 November 2009 4May 2009 March 2009 –2 June 2008 May 2008 –1 March 2010 January 2010 –2 August 2008 May 2009 9 March 2009 January 2009 –2 May 2008 January 2008 –4construction end date November 2007 March 2008 Change in programme Forecast end date baseline (months)September 2009 September 2009 0 September 2008 November 2008 2 December 2011 December 2011 0 December 2011 August 2011 –4 Construction Construction end date end dateFebruary 2011 April 2011 2Completion date Completion date for construction for construction and initial overlay and initial overlay for test events for test eventsJune 2011 June 2011 0Construction Construction end date end dateApril 2011 August 2011 4Completion date Completion date for construction for construction and initial overlay and initial overlay for test events for test eventsJuly 2011 August 2011 1April 2011 February 2011 –2April 2011 March 2011 –1April 2011 April 2011 0June 2011 July 2011 1 December 2011 December 2011 0 February 2012 April 2011 –10 June 2010 October 2010 4 April 2010 July 2009 –9 February 2009 January 2009 –1Source: National Audit Office examination of actual and forecast progress against the November 2007 Programme BaselineNOTE 0.623241
16 3335 6 NAO_Preparations_for_the_Olympics_-_Progress_report_-_June_2008.pdf Construction of the Main Stadium and its readiness for test events is critical to the Authority’s delivery programme. At the time of the November 2007 Programme Baseline Report, which was before contract signature, construction of the Main Stadium was scheduled to be completed in February 2011, to be followed by LOCOG overlay works for test events which were to be completed by June 2011. To date good progress has been made on site preparation, which has allowed construction to start in May 2008, two months earlier than planned.1 As a result of contract negotiations the Authority has agreed a longer construction period than anticipated (35 months rather than 32). The Authority has, however, built early access dates for LOCOG overlay works into the contract so overlay can be carried out in parallel with the construction work. The forecast date for readiness for test events remains at June 2011 0.619893
17 3336 6 NAO_Preparations_for_the_Olympics_-_Progress_report_-_June_2008.pdf Construction of the Aquatics Centre and its readiness for test events is critical to the Authority’s delivery programme. At the time of the November 2007 Programme Baseline Report, which was before contract signature, construction of the Aquatics Centre was scheduled to be completed in April 2011, to be followed by LOCOG overlay works for test events which were to be completed by July 2011. As a result of contract negotiations, the Authority has agreed a longer construction period. The Authority considers, however, at this stage the impact on the date of readiness for test events is minimal. The Authority has built early access dates for LOCOG overlay works into the contract so that overlay can be carried out in parallel with the construction work. The forecast date for readiness for test events of August 2011 is one month later than originally expected 0.586903
18 8304 7 PAC_26_April_2012_-_Olympic_Costs_-_corrected_evidence_(no_report).pdf You think that the assessment of risks is our best estimate of the most likely outcome of the budget as a whole. But actually the assessment of risk-and how we have compiled it-is this: we have not sought to estimate how likely it is that every risk arises. We just said, "Let us think about every risk that could arise, and let us assume that they all arise and work out the likely cost of them all arising." On top of that, we said, "And there will be some risks that we just cannot think about that are unknown unknowns. There will be some multiple consequentials if everything came together." So we end up with an estimate not of the most likely cost of the project, which is what the burden of paragraph 1 of the PAC Report understands it is, but an estimate of how much we would need to set aside in the very unlikely event that all risks arise and some more unknown risks arise as well. The purpose of that is not to get to an estimate of the likely outcome of the budget. Its purpose and why we do it is to see, against any reasonable view of the likely risk that might arise, even on an assumption that they all arise and some more unknown risks arise, whether we have enough money. The conclusion has always been, yes, we had. Against what is therefore, in my view, a conservative and prudent estimate, we had £36 million headroom at the time of the NAO Report. We had more, and indeed the picture over the six-month period since the original figures on which the NAO was recording this, is that the contingency has gone down by £27 million or so-we reckon, because these are provisional figures, but I want to give our best figures-and the assessed risks on that very conservative and prudent basis have gone down by £136 million. So the picture on the budget as a whole is that we are spending contingency significantly slower than risks are disappearing from the programme. That is why, without in any sense being complacent, I am confident that we will bring this in within budget, and I do not think that the budget is close to being used up 0.605655
19 8388 7 PAC_26_April_2012_-_Olympic_Costs_-_corrected_evidence_(no_report).pdf Jonathan Stephens: There are two aspects to risk: likelihood and impact. What we are saying is that we made no estimate of likelihood, we just wrote in a 100% likelihood of all the risks we could think of and some unknown risks that we could not think of. We then looked at impact, and on impact we said, "If this risk were to materialise-we are assuming a 100% likelihood that it materialises-what is the likely cost?" That is where you get the low, the most likely outcome and the high outcome. When you add those together, you do not get to an outcome of, "What is the most likely expenditure on the programme?"; you get to an outcome of, "If all conceivable risks arise, plus some unknown risks that we cannot identify, what is the likely expenditure?" That is a conservative and prudent view of, "Do we have enough contingency left, if all those risks arise?" In practice, they won’t all arise. It is conceivable that some will arise, but it is pretty unlikely that all of them will arise. It is perfectly conceivable that some individual risk will arise at a higher estimate than the most likely estimate, but the prospect of all those risks arising is unlikely. The prospect of them all arising at the very highest possible cost is so unlikely as to not provide a good basis for planning. I am sorry. I am going on at some length, but there is a real point 0.570251
20 8439 7 PAC_26_April_2012_-_Olympic_Costs_-_corrected_evidence_(no_report).pdf Q66 Ian Swales: Just to build on what the CAG is saying, although we do not need to go over all this detail again, the thing that I think comes out of your letter of 25 January-which of course we did not have when we had the hearing-was in particular the G4S contract going from £86 million to £284 million. You observe about labour costs that, of that increase of £198 million, £83 million is labour. The Committee had all sorts of questions about whether that was a sensible figure, given the number of volunteers and military and so on involved, but let us not go over that again. Anyone who has negotiated a contract like this would comment on an increase in programme management costs from £7 million to £60 million, and in operational expenditure-uniforms, etc.-from £3 million to £65 million. It beggars belief that G4S thought that from a total of £10 million for those items, excluding labour, it could now convince you that costs have moved to £125 million. Your comments in Civil Service World beggar everyone’s belief, that if we had known how many security guards were needed, instead of £10 million we would have agreed £125 million for the two figures. It does not feel that that can be credible 0.483295
21 4654 8 Oral_evidence_-_15_Dec_2009_LOCOG_ODA.pdf Mr Deighton: I think the first thing is in the next one and a half years or so anybody presenting themselves as selling new tickets we want to make quite clear to the general public that cannot be possible, because tickets only go on sale in 2011. Any attempt by anybody—and this is something I think we would like the Committee's help on—if you are hearing about this in your constituencies where they think they are being offered tickets, let us know; we are trying to be as public as possible saying, "Anything that happens pre-2011 before our formal ticket launch can't be possible". Anyone trying to do that is effectively trying to take advantage of the public. I think that is the first thing. Secondly, of course, with Olympic tickets, like Premier League tickets, the law is very, very clear; they cannot be resold at a profit. We will be working with the police to follow up on any instances where we think that is happening; whether it is tracking it down on the internet, or whether it is happening in practice. I think the other thing that is really important, as I suggested earlier, we are very keen in our business plan for ticketing to make sure that we get those tickets initially into the hands of the people who most want them, so the opportunity to create this kind of secondary market is as limited as possible. That is why we are really building a very carefully constructed book for each sport, so once those guys have got that ticket it is not going to come back onto the market. If they do want to resell the ticket, we have a plan to develop a London 2012 ticket exchange so they will be able to resell it through us, so they will have no excuse. For whatever reason they cannot go, we will be able to resell it for them; not at a profit but we will be able to take care of their original purchase price. That again is a small facility which should constrain the supply of potential tickets that could represent a touting risk if not controlled 0.593694
22 4644 8 Oral_evidence_-_15_Dec_2009_LOCOG_ODA.pdf Indeed, if you are trying to sell 9.2 million tickets to such a range of sports, a number of which are not that well known in the UK, the only way we are going to get full stadia, full of enthusiastic fans, is to make them highly affordable, and that is our objective. It is our objective; it is certainly the objective of the Olympics Minister; it is certainly the objective of the Mayor of London; and those stakeholders are actually defining our final pricing strategy over the months ahead. The only reason it is difficult at this point to talk in such specific terms, as we have with our bid promise, is that of course back then the Games actually had a different portfolio of sports: we had baseball and softball, where you would have had 700,000 tickets at the very low end of the pricing. We just want to make sure we have got a very precise grip of the supply side of the tickets. What we have been working on is our competition schedule, so we know exactly what sport is going to be when; what is in the morning; what is in the afternoon; how many sessions we have. Just last week the IOC confirmed for example the format of the cycling competitions—where they redistributed some events towards the sprints and equalising events between men and women. You have to know what events you are putting on before you know what tickets you have got to sell. We are also looking very specifically at the seating bowl, so again I know how many seats I have to sell. We are working with the broadcasters, for example, on camera positions. If you have lots of camera positions you have fewer seats left to sell. All that work, on the schedule, the seats, is what you need to know for how many you have got; and then on the demand side, we are building up a sport-by-sport plan for who the fans are, and who is going to come and watch these sports. It is a very different proposition to get somebody to come to the final of the 100 metres in the main stadium, compared to someone to come, as I mentioned earlier, to the preliminary rounds of the handball competition where handball has not been regularly played in the United Kingdom. We need a sport-by-sport analysis of where that demand is 0.540326
23 4632 8 Oral_evidence_-_15_Dec_2009_LOCOG_ODA.pdf Mr Deighton: This is a very interesting question. The essence of every sponsorship agreement is that each company seeks exclusivity within its own category. Our role is to protect their rights around that exclusivity in that category. The complexity comes from when you introduce potential retailers, because of course we have a number of existing sponsors who sell products and they sell products through different retailers. If we create a retailing category of course those retailers also sell the products of competitors to our existing sponsors. What we have to establish is a transaction which protects our existing sponsors yet would also allow the retailer to be able to activate its rights with the Games. We continue to have discussions to try and make that a possible compromise. Our primary objective is to make sure that the rights of our existing sponsors are properly protected, because of course that is what they have paid for. 0.519459
24 454 9 Jan_2003_-_Qs_200-220.pdf The Committee suspended from 4.09 pm to 4.23 pm for a division in the House Alan Keen 202. I did not get to the end of the question at the beginning but the point I am making is that because we have to have a village and all the events have to be in that area it adds costs to hosting the Olympics, I reckon at least half a billion and probably a billion. If we could spread them round the country—and I went to Japan for the World Cup and the atmosphere was brilliant. We went to different places—more people could get to see it. If we could do that with the Olympics, the point I am really asking you is that it is difficult for the Government. The Minister and the Secretary of State are going to see the President of the IOC on Friday. It will not do our bid any good if they go there telling them how they should organise the Olympic Games in the future. I am really asking you as the main channellers of funding in sport in this country, will you make these representations that the Olympics, just for the sake of having 18,000 athletes in one village, which is very nice, although it is not so nice for those whose event comes on the last day and they want a party—we could save somewhere between half a billion and a billion pounds by using facilities we have got around the country now. The athletics could be at Wembley as they were supposed to be. The football could be at the main stadium and spread around the country as it is going to be in fact. What I am saying is that instead of having the athletes all together in one village for the three weeks of the Olympics, we could put a party on for them and they could stay for a week after the Olympics when they could all get drunk if that is what they do. I think somebody needs to go to the IOC and put this point to them. We have been taking evidence from people in the last couple of days and there are tremendous difficulties. There would hardly be a difficulty if we could use stadia around the country and we did not have to have the village. It is the village that causes all the problems that we are facing now 0.621515
25 466 9 Jan_2003_-_Qs_200-220.pdf 206. But this has only just come up, has it not? In all our enquiries into Wembley Stadium lasting over years the problem of the location of the village in relation to the stadium has never ever come up. Now we are told that we need a new stadium in East London that is likely to be surplus to requirements as soon as the Olympics are over, should we get them, because of its juxtaposition to a village whose location has not been decided upon anyhow. I realise that distances in Manchester are not the same as distances in London but the village in Manchester was some considerable distance away from the stadium in Manchester and because, among other things, the structure created by Mr McCartney included an excellent transport system, there was no problem in getting there. I lament that we are having to put these questions to you but this was not a relevant issue yesterday when we had the IOC and the BOA in front of us. Why at this stage has this whole issue of the juxtaposition of the village to the stadium become critical to the likelihood of a successful bid and why have we only just heard about it now? Why, after all of these years, ever since 1996 when Sport England handed over £120 million for Wembley, is it, seven years later, that this has suddenly surfaced as a problem 0.586045
26 501 9 Jan_2003_-_Qs_200-220.pdf introduce the ability to stage athletics in the new Wembley and that is the situation that exists today. Where I think there is a danger of us mixing apples with pears is there is no doubt that Wembley as it was originally conceived, as Michael Cunnah has confirmed in his letter to you today, has the facilities to provide the athletics venue for an Olympic bid if that were the wish of those who are responsible for mounting the bid to the IOC. About two years ago I think the Mayor of London became involved in discussions with the BOA and at that stage, as Ian has quite rightly confirmed to you, an analysis of a West London bid with Wembley as a centrepiece was examined and we began to explore for the first time—I did not but it began to be explored—the East London option which looked at the present site and all the regeneration benefits that that would bring to the area and the legacy that would result, and it was at that time that a quite deliberate and conscious decision was taken that if we had to bid for the Olympic Games then the bid would be based upon a location in East London and not in West London. I can confirm that I am aware that considerations were given to village accommodation in the West London area. I do not think it ever got to an advanced stage because by that time people's attention had turned to East London. So I think the honest answer is the West London options were looked at but never became fully developed because whilst that was being examined the attractions of East London became known and people's attention turned in that direction. 0.577957
27 10389 10 Report_and_Minutes_-_Jan_2007.pdf 19. There was much more optimism about scope for increasing tourist traffic after the Games. Mr Castle, the East of England representative on the Nations and Regions Group, described the Games as a “shop window” for the UK. Both he and the Tourism Management Institute saw scope for the Games to generate business tourism.254 The DCMS memorandum stated that “experience from recent host cities indicates that tourism will increase significantly across the UK, most notably after the Games”;255 and the Tourism Alliance told us that DCMS expected that up to 80% of the legacy benefit to be derived from hosting the Games would be gained through “increased tourism as a result of [the] high degree of international media exposure”.256 The Tourism Alliance itself agreed that the main way that lasting benefits would be reaped would be through media exposure; but it saw Government investment in a tourism strategy as being a necessary part of drawing on that exposure; and it spoke of a “lack of realisation within DCMS that additional funds need to be committed … to marketing and media support”. The Government has pledged that the interests of tourism “will be taken into account in all Olympic policy decisions”; underlying this pledge, however, was a statement by the Secretary of State that, in order to increase the number of visitors as a result of the Games, the tourism industry needed “to improve the consistency of its quality, raise the level of skill and, through imaginative marketing, showcase Britain’s heritage and its dynamic, 21st century cities”.257 0.475559
28 11848 10 Report_and_Minutes_-_Jan_2007_-_vol_2_-_evidence.pdf 2.2 We welcome the opportunity to support the Games and are already investing in some of the keyinstitutions which will deliver a high quality Games. Institutions such as Birmingham Museum and ArtGallery, Bristol Museum and Art Gallery, Tyne and Wear Museums and the Museum of London have allreceived funding through Renaissance in the Regions. Renaissance is MLA’s programme for thetransformation of England’s regional museums. It is the first central government investment of its kind formuseums, and presents a structure through which a co-ordinated oVer amongst regional museums could bedeveloped and resources directed to support. It is therefore crucial that this existing investment is sustained,particularly as we enter a tight funding round in the 2007 comprehensive spending review. Now is the timeto build on and develop this successful programme, which if cut will severely curtail the capacity of themuseums sector to support and deliver the Cultural Olympiad and develop the UK tourism oVer. It shouldbe noted there is no national funding programme for archives, and whilst Framework for the Future oVersa programme for public libraries its focus is on improving the library service and the repositioning of publiclibraries, and does not fund organisations directly 0.448851
29 2306 10 Memorandum_submitted_by_the_Olympic_Lottery_Distributor_-_Jan_2007.pdf OLD only has the money which is made available by Lottery players. While it is not for us to sell Lottery tickets we have an interest in people continuing to play the Lottery. We see our role in helping make sure Lottery players know where their money has gone and we are in discussion on this complex issue with the ODA and LOCOG. More broadly we see ourselves as having a duty to safeguard the reputation of the Lottery insofar as it is supporting the Olympic vision behind which much of the nation united last year. We see our role of playing an active role in ensuring that Lottery money delivers Olympic objectives as helping in this respect. We are also anxious that the London Olympics are seen as another reason to play the Lottery and that the prospect of the Games will help grow the Lottery's income. We are part of the wider Lottery family and we are not indifferent to concerns that the funding requirements of the Olympics will deprive the other Good Causes and we know that damage to the reputation of the Lottery may make us all poorer. We hope therefore that the Olympics can be used as a high profile positive message in Lottery marketing 0.446228

5.3.4. N most prototypical paragraphs of each topic

For each topic, print the N paragraphs with the highest probability that they belong to the topic.

In [49]:
# 2 most prototypical paragraphs of each topic
N2 = 2
top_n_filter(df_topic_para2, N2).style.set_properties(subset = ['paragraph'], **{'width':'500px', 'length': '50px'})
Out[49]:
Index topic_id file paragraph probability
0 9932 1 Qs_202_-_238.pdf Lord Moynihan: Adrian, I am looking to the future and in this new found position of being apolitical and outside criticism of what may or may not have happened in the past I can tell you what is required moving forward is significant improvements rather than dwelling on what went wrong in the past. The first thing that is required is consistent funding. It is absolutely essential that we have consistent funding to the Olympic governing bodies. The second thing that is required is hiring the best coaches. I mentioned earlier that coaches are centres of excellence, and they really are. The coach inspires, the coach motivates, the coach can get the best out of our young Olympians of the future. We will deliver far better than we have in the past. The importance given to coaching needs to be on a step change level from where it has been in the past. I am very pleased to say that UK Sport and the Government are at one with us on the importance of reinforcing more emphasis on coaching. Then I mentioned in a highly competitive market, which it will be, to achieve fourth place we need to make sure not just the coaching but the management, the administration, the medical, the support services to our sports men and women are there and that we have the back-up facilities, the high performance centres properly resourced and that the governing bodies are given the support they need to make the decisions on behalf of their staff, their performance directors, their coaches. Ultimately we are all the servants of the sports men and women who will be up there winning medals in 2012. I am absolutely convinced that with the strength of the governing bodies and the talent we have in this country that is an achievable and deliverable target. I am absolutely convinced that unless Clearing the Bar is accepted we will not achieve that stretch target. That is why we are spending so many hours, the days and nights and little time we have available getting that document right and working with UK Sport to make sure we get that right. We have to do it now, we cannot afford to wait until the end of next year. We need to make sure that the funding is secure so that governing bodies can take the steps which are necessary next year. The time invested in the beginning of next year will be worth a huge amount when we get to 2012. It is time invested upfront to make sure that those plans can deliver the success. That means we will look at Beijing in a different light from how we would normally look at Beijing. We will look at Beijing as a stepping stone towards success in 2012 which may not best be judged by medal tallies, for example, but must be judged by how much progress the squads, the teams and the coaches have made en route from here to 2012; which will be our principal goal 0.615441
1 3639 1 NAO_Preparing_for_sporting_success_-_March_2008.pdf 4 UK Sport’s ‘ultimate goals’ for medal success at the London 2012 Games will require a step change in performance amongst elite athletes. The achievements of athletes at recent elite international events in a number of sports, including sailing, cycling, rowing, boxing, disability equestrian and disability shooting, suggest that performance levels in some sports are already improving significantly. Following increased spending on elite sport, host nations can typically expect to win an extra six or seven gold medals at an Olympic Games and to win medals across a wider range of sports. This ‘host nation effect’ would not in itself be enough to deliver UK Sport’s Olympic goal, which is likely to require an improvement of eight or nine gold medals over the Great Britain team’s performance at the Athens Games in 2004 if the relative performance of other nations remained the same. Changes in the performance of other nations since 2004, especially in the context of a general trend of increased spending on elite sport, sometimes referred to as a ‘global sporting arms race’, may also have implications for UK Sport in delivering its medal aspirations 0.556173
2 17552 2 Written_evidence_submitted_by_Vision_2020_UK_-_Jan_2010.pdf 4. Most clubs seem ill-prepared for enquiries from, and inclusion of, people with disabilities who wish to participate in the sport offered by the club. There is little or no support for specialist clubs who provide opportunities for sport that cannot be integrated ie wheelchair basketball or blind cricket! There is often nothing locally to support the child or their parent in accessing the specialist provision and this often involves their having to undertake extensive travel to specialist facilities or organisations catering for this group. There is poor information regarding availability etc and http://www.parasport.org.uk was established as a portal to provide pathway and provision information. The then Mayor of London published a strategy in 2007, which highlighted all these issues and to date there has been little, if any, action to redress these anomalies in London or in the rest of the country. DCSF have, in my view, shown no leadership regarding the legacy of 2012 and its impact on PE in schools and the inclusion of those with disabilities in core curriculum activities or sporting opportunities within or after school. DCMS held a legacy event in 2008 and again in April 2009 focusing on the legacy of the Games for those with disabilities. One outcome was to seek greater links between DWP, DCSF and themselves to ensure that joint strategies were developed and pathways established that enabled children to enjoy and participate in PE and sport within schools/after school clubs/integrated and specialist provision in the community, with good national talent forums and pathways established for those who wish to participate in sport at a higher level and finally, with governing bodies having clear inclusive programmes for sports men and women with disabilities active at a national and international level 0.676578
3 17551 2 Written_evidence_submitted_by_Vision_2020_UK_-_Jan_2010.pdf 3. After school activities often exclude this group of children ie no possibility of inclusion in team sports such as football, cricket, rugby, basket ball etc., limited access to swimming baths, athletic fields etc. All of these sports are undertaken by people with a disability, but not now normally at school. Specialist schools did provide a massive range of sporting opportunities and sport played a major part of my own adjustment to disability. I learned about teamwork, was able to set individual goals, have competition to extend my abilities, occasionally experienced "being a winner" and had that thrill of competition. One example of efforts to redress this issue within a school setting is in Leeds, which has a programme of monthly sporting and physical activities arranged in school time for vision impaired children within the schools. One solution easily achieved would be for groups of schools to come together monthly to provide sporting and recreational activities for disabled children and young people within their schools 0.627952
4 2929 3 NAO_Preparations_for_the_Olympics_-_Progress_report_-_Feb_2010.pdf Figure 2The Delivery Authority’s November 2007 baseline for the Village and Media Centre and its December 2009 forecastall figures in £ million, inclusive of Vat november 2007 baseline Revised budget approved by funders in 2009latest forecast of potential costs december 2009Village and Stratford City land and infrastructure Land and infrastructure – public sector funding 522 553 560Estimated share of profits from post Games development (250) (250) (100)1Net Olympic Delivery Authority budget 272 303 460Village development Village construction and sales (gross) costs 0 1,095 1,1262Agreed sale of 1,379 units to Triathlon Homes 0 (268) (268)Estimated receipts from private sales – to fund construction 0 (177) (177)Village development – sub total 0 650 681Estimated receipts from private sales – to repay funders 0 (324) (324)Net Olympic Delivery Authority budget 0 326 357Media Centre3Media Centre 220 355 334Net Olympic Delivery Authority budget 220 355 334Total net cost to the Olympic Delivery Authority 492 984 1,151Source: National Audit Offi ce noteS1 See paragraph 4.10 0.602559
5 7679 3 PAC_-_Risk_assessment_and_management_-_July_2007_report.pdf 6. At the time of our hearing, a number of significant areas of uncertainty remained before the budget could be finalised.18 Tax – The Department explained that tax had been excluded from the cost estimates at the time of the bid on the basis that the tax treatment could not be resolved until the delivery structures for the Games were in place.19 Contingency provision – The cost estimates at the time of the bid had included a contingency provision in respect of individual projects, but the Department now considered that an overall programme contingency margin was required to reflect the interdependencies between projects and the risks associated with the knock-on effect that problems on one project could have on the rest of the programme.20 Security – It had not been possible at the time of the bid to provide a reliable estimate of the costs of policing and wider security, and the Department had notified Parliament of a contingent liability in this respect. The Olympic Security Co-ordinator was now working up plans and budgets in association with the Home Office and the Metropolitan Police.21 Private sector funding – At the time of the bid, £738 million of private funding had been expected to help meet costs associated with the Olympic Park. In the light of further work and advice, the Department now considered there was insufficient time to negotiate contracts with the private sector within the overall timescale for the Games, so there was now little prospect of securing significant private sector funding to deliver the Olympic Park. However, most of the funding for the Olympic Village was still expected to come from the private sector.22 0.596071
6 2002 4 Memorandum_submitted_by_LOCOG_-_Nov_2007.pdf Chief Medical Officer Creative Director Director of Strategy and Programme Management Head of Procurement Head of Client Services Head of Education Head of Venues Technology Head of Programme Solutions Head of Workforce Planning Head of Accommodation Head of Sport Policy Head of Culture Head of Telecommunications Head of Ceremonies Head of Administrative IT Head of Live Site Head of Venue Management Head of Ticketing — We now have in place the core components required to undertake the detailed planning for the Games. The team however remains small, at just under 200 people and our recruitment is planned carefully on the basis of the core work that needs to be undertaken now. 0.553650
7 17005 4 Written_evidence_submitted_by_Olympic_Park_Legacy_Company_-_Feb_2010.pdf 10. The Olympic Park Legacy Company's strategic objectives, as set by its Founders cover: — assisting the Government and the Mayor of London in fulfilling some of the legacy promises made in the bid to host the London 2012 Olympic and Paralympic Games; — securing the timely development of the Olympic Park site as a high quality and sustainable mixed community; — promoting social, economic and environmental benefits for local communities; — securing the long term development and management of the Olympic Park site and venues in ways which provide lasting national and local sporting, cultural, education and leisure benefits and which preserve the site's Olympic heritage; — working with partners to contribute to long-term economic growth and prosperity in the wider area; — levering in private investment to maximise and provide best value for the public purse; and — promoting sustainable development, community involvement and equality of opportunity 0.490745
8 16483 5 Written_evidence_submitted_by_Host_Boroughs_Unit_-_Feb_2010.pdf 4.1 The aim is that in the next 20 years, residents in the host boroughs will equal the London average in a range of the life indicators which you would expect to find in a successful community: — employment rates will increase to the London average; — average incomes in the bottom two fifths of earners in the host borough area will be increased to the London average; — young people in the host borough area will have improved GCSE results to at least the London average; — host borough 11 year olds will have at least the same educational attainment as the London average; — the number of families in receipt of benefits in the host boroughs area will fall to no more than the London average; — the rate of violent crime will continue to fall and reflect the London average; and — residents in the host boroughs area, particularly men, will have increased life expectancy to the London average 0.566613
9 4947 5 Oral_evidence_-_17_March_2010_Qs_100-143_-_Boroughs.pdf Sir Robin Wales: It is worth making a comment here. It has been very interesting watching this because Jules has led very much with a vision for this place, which the five boroughs have supported but Jules has driven that vision. It is as we have gone on and people have begun to connect with the SRF and understand what we are trying to do that people have begun to realise we are now with the OPLC, which I think is people understanding that legacy is important and beginning to line up behind the vision that Jules has pushed extensively because he understands the nature of his community and how that might work and how it will relate. I think it is a really good example of something being pushed by a local borough, backed by the rest of us, looking to have a vision that will make a difference there and will link in with the community he has got. It comes back to heroic economic assumptions. What comes out at the end will come out, but at least we are trying to do something that will deliver, something that will work for the local area, based on the vision that we have had locally and people are now beginning to line up to. So the question now is: do we get people lined up to support us on public policy and then the jobs that will come out will be the jobs that come out and they will begin to make a difference, particularly to Hackney but also to some of the people in Newham who will be able to access that, and other boroughs. It is a really good example of how the vision has been led by boroughs and people are now getting it 0.536632
10 3329 6 NAO_Preparations_for_the_Olympics_-_Progress_report_-_June_2008.pdf 6 PREPARATIONS FOR THE LONDON 2012 OLyMPIC AND PARALyMPIC GAMES: PROGRESS REPORT JuNE 2008 10 The start and completion dates for the construction of the main venue and infrastructure projects delivered by the Olympic Delivery Authority at the end of March 2008 compared with the milestones in the November 2007 Programme Baseline ReportProjectEnabling Works (site preparation) Power Lines under Grounding (switchover only) Structures, Bridges and Highways utilities Main Stadium Aquatics Centre VeloparkHandball/Indoor Sports ArenaBasketballInternational Broadcast Centre/ Main Press CentreOlympic Village Eton Manor (training facilities and Paralympic events) Broxbourne (white water canoeing) Eton Dorney (rowing) Weymouth and Portland (sailing)construction start date November 2007 March 2008 Change in programme Forecast start date baseline (months)October 2006 October 2006 0 July 2008 July 2008 0 April 2008 April 2008 0 January 2008 January 2008 0 July 2008 May 2008 –21 September 2008 September 2008 0 March 2009 March 2009 0August 2009 June 2009 –2July 2009 November 2009 4May 2009 March 2009 –2 June 2008 May 2008 –1 March 2010 January 2010 –2 August 2008 May 2009 9 March 2009 January 2009 –2 May 2008 January 2008 –4construction end date November 2007 March 2008 Change in programme Forecast end date baseline (months)September 2009 September 2009 0 September 2008 November 2008 2 December 2011 December 2011 0 December 2011 August 2011 –4 Construction Construction end date end dateFebruary 2011 April 2011 2Completion date Completion date for construction for construction and initial overlay and initial overlay for test events for test eventsJune 2011 June 2011 0Construction Construction end date end dateApril 2011 August 2011 4Completion date Completion date for construction for construction and initial overlay and initial overlay for test events for test eventsJuly 2011 August 2011 1April 2011 February 2011 –2April 2011 March 2011 –1April 2011 April 2011 0June 2011 July 2011 1 December 2011 December 2011 0 February 2012 April 2011 –10 June 2010 October 2010 4 April 2010 July 2009 –9 February 2009 January 2009 –1Source: National Audit Office examination of actual and forecast progress against the November 2007 Programme BaselineNOTE 0.623241
11 3335 6 NAO_Preparations_for_the_Olympics_-_Progress_report_-_June_2008.pdf Construction of the Main Stadium and its readiness for test events is critical to the Authority’s delivery programme. At the time of the November 2007 Programme Baseline Report, which was before contract signature, construction of the Main Stadium was scheduled to be completed in February 2011, to be followed by LOCOG overlay works for test events which were to be completed by June 2011. To date good progress has been made on site preparation, which has allowed construction to start in May 2008, two months earlier than planned.1 As a result of contract negotiations the Authority has agreed a longer construction period than anticipated (35 months rather than 32). The Authority has, however, built early access dates for LOCOG overlay works into the contract so overlay can be carried out in parallel with the construction work. The forecast date for readiness for test events remains at June 2011 0.619893
12 8304 7 PAC_26_April_2012_-_Olympic_Costs_-_corrected_evidence_(no_report).pdf You think that the assessment of risks is our best estimate of the most likely outcome of the budget as a whole. But actually the assessment of risk-and how we have compiled it-is this: we have not sought to estimate how likely it is that every risk arises. We just said, "Let us think about every risk that could arise, and let us assume that they all arise and work out the likely cost of them all arising." On top of that, we said, "And there will be some risks that we just cannot think about that are unknown unknowns. There will be some multiple consequentials if everything came together." So we end up with an estimate not of the most likely cost of the project, which is what the burden of paragraph 1 of the PAC Report understands it is, but an estimate of how much we would need to set aside in the very unlikely event that all risks arise and some more unknown risks arise as well. The purpose of that is not to get to an estimate of the likely outcome of the budget. Its purpose and why we do it is to see, against any reasonable view of the likely risk that might arise, even on an assumption that they all arise and some more unknown risks arise, whether we have enough money. The conclusion has always been, yes, we had. Against what is therefore, in my view, a conservative and prudent estimate, we had £36 million headroom at the time of the NAO Report. We had more, and indeed the picture over the six-month period since the original figures on which the NAO was recording this, is that the contingency has gone down by £27 million or so-we reckon, because these are provisional figures, but I want to give our best figures-and the assessed risks on that very conservative and prudent basis have gone down by £136 million. So the picture on the budget as a whole is that we are spending contingency significantly slower than risks are disappearing from the programme. That is why, without in any sense being complacent, I am confident that we will bring this in within budget, and I do not think that the budget is close to being used up 0.605655
13 8388 7 PAC_26_April_2012_-_Olympic_Costs_-_corrected_evidence_(no_report).pdf Jonathan Stephens: There are two aspects to risk: likelihood and impact. What we are saying is that we made no estimate of likelihood, we just wrote in a 100% likelihood of all the risks we could think of and some unknown risks that we could not think of. We then looked at impact, and on impact we said, "If this risk were to materialise-we are assuming a 100% likelihood that it materialises-what is the likely cost?" That is where you get the low, the most likely outcome and the high outcome. When you add those together, you do not get to an outcome of, "What is the most likely expenditure on the programme?"; you get to an outcome of, "If all conceivable risks arise, plus some unknown risks that we cannot identify, what is the likely expenditure?" That is a conservative and prudent view of, "Do we have enough contingency left, if all those risks arise?" In practice, they won’t all arise. It is conceivable that some will arise, but it is pretty unlikely that all of them will arise. It is perfectly conceivable that some individual risk will arise at a higher estimate than the most likely estimate, but the prospect of all those risks arising is unlikely. The prospect of them all arising at the very highest possible cost is so unlikely as to not provide a good basis for planning. I am sorry. I am going on at some length, but there is a real point 0.570251
14 4654 8 Oral_evidence_-_15_Dec_2009_LOCOG_ODA.pdf Mr Deighton: I think the first thing is in the next one and a half years or so anybody presenting themselves as selling new tickets we want to make quite clear to the general public that cannot be possible, because tickets only go on sale in 2011. Any attempt by anybody—and this is something I think we would like the Committee's help on—if you are hearing about this in your constituencies where they think they are being offered tickets, let us know; we are trying to be as public as possible saying, "Anything that happens pre-2011 before our formal ticket launch can't be possible". Anyone trying to do that is effectively trying to take advantage of the public. I think that is the first thing. Secondly, of course, with Olympic tickets, like Premier League tickets, the law is very, very clear; they cannot be resold at a profit. We will be working with the police to follow up on any instances where we think that is happening; whether it is tracking it down on the internet, or whether it is happening in practice. I think the other thing that is really important, as I suggested earlier, we are very keen in our business plan for ticketing to make sure that we get those tickets initially into the hands of the people who most want them, so the opportunity to create this kind of secondary market is as limited as possible. That is why we are really building a very carefully constructed book for each sport, so once those guys have got that ticket it is not going to come back onto the market. If they do want to resell the ticket, we have a plan to develop a London 2012 ticket exchange so they will be able to resell it through us, so they will have no excuse. For whatever reason they cannot go, we will be able to resell it for them; not at a profit but we will be able to take care of their original purchase price. That again is a small facility which should constrain the supply of potential tickets that could represent a touting risk if not controlled 0.593694
15 4644 8 Oral_evidence_-_15_Dec_2009_LOCOG_ODA.pdf Indeed, if you are trying to sell 9.2 million tickets to such a range of sports, a number of which are not that well known in the UK, the only way we are going to get full stadia, full of enthusiastic fans, is to make them highly affordable, and that is our objective. It is our objective; it is certainly the objective of the Olympics Minister; it is certainly the objective of the Mayor of London; and those stakeholders are actually defining our final pricing strategy over the months ahead. The only reason it is difficult at this point to talk in such specific terms, as we have with our bid promise, is that of course back then the Games actually had a different portfolio of sports: we had baseball and softball, where you would have had 700,000 tickets at the very low end of the pricing. We just want to make sure we have got a very precise grip of the supply side of the tickets. What we have been working on is our competition schedule, so we know exactly what sport is going to be when; what is in the morning; what is in the afternoon; how many sessions we have. Just last week the IOC confirmed for example the format of the cycling competitions—where they redistributed some events towards the sprints and equalising events between men and women. You have to know what events you are putting on before you know what tickets you have got to sell. We are also looking very specifically at the seating bowl, so again I know how many seats I have to sell. We are working with the broadcasters, for example, on camera positions. If you have lots of camera positions you have fewer seats left to sell. All that work, on the schedule, the seats, is what you need to know for how many you have got; and then on the demand side, we are building up a sport-by-sport plan for who the fans are, and who is going to come and watch these sports. It is a very different proposition to get somebody to come to the final of the 100 metres in the main stadium, compared to someone to come, as I mentioned earlier, to the preliminary rounds of the handball competition where handball has not been regularly played in the United Kingdom. We need a sport-by-sport analysis of where that demand is 0.540326
16 454 9 Jan_2003_-_Qs_200-220.pdf The Committee suspended from 4.09 pm to 4.23 pm for a division in the House Alan Keen 202. I did not get to the end of the question at the beginning but the point I am making is that because we have to have a village and all the events have to be in that area it adds costs to hosting the Olympics, I reckon at least half a billion and probably a billion. If we could spread them round the country—and I went to Japan for the World Cup and the atmosphere was brilliant. We went to different places—more people could get to see it. If we could do that with the Olympics, the point I am really asking you is that it is difficult for the Government. The Minister and the Secretary of State are going to see the President of the IOC on Friday. It will not do our bid any good if they go there telling them how they should organise the Olympic Games in the future. I am really asking you as the main channellers of funding in sport in this country, will you make these representations that the Olympics, just for the sake of having 18,000 athletes in one village, which is very nice, although it is not so nice for those whose event comes on the last day and they want a party—we could save somewhere between half a billion and a billion pounds by using facilities we have got around the country now. The athletics could be at Wembley as they were supposed to be. The football could be at the main stadium and spread around the country as it is going to be in fact. What I am saying is that instead of having the athletes all together in one village for the three weeks of the Olympics, we could put a party on for them and they could stay for a week after the Olympics when they could all get drunk if that is what they do. I think somebody needs to go to the IOC and put this point to them. We have been taking evidence from people in the last couple of days and there are tremendous difficulties. There would hardly be a difficulty if we could use stadia around the country and we did not have to have the village. It is the village that causes all the problems that we are facing now 0.621515
17 466 9 Jan_2003_-_Qs_200-220.pdf 206. But this has only just come up, has it not? In all our enquiries into Wembley Stadium lasting over years the problem of the location of the village in relation to the stadium has never ever come up. Now we are told that we need a new stadium in East London that is likely to be surplus to requirements as soon as the Olympics are over, should we get them, because of its juxtaposition to a village whose location has not been decided upon anyhow. I realise that distances in Manchester are not the same as distances in London but the village in Manchester was some considerable distance away from the stadium in Manchester and because, among other things, the structure created by Mr McCartney included an excellent transport system, there was no problem in getting there. I lament that we are having to put these questions to you but this was not a relevant issue yesterday when we had the IOC and the BOA in front of us. Why at this stage has this whole issue of the juxtaposition of the village to the stadium become critical to the likelihood of a successful bid and why have we only just heard about it now? Why, after all of these years, ever since 1996 when Sport England handed over £120 million for Wembley, is it, seven years later, that this has suddenly surfaced as a problem 0.586045
18 10389 10 Report_and_Minutes_-_Jan_2007.pdf 19. There was much more optimism about scope for increasing tourist traffic after the Games. Mr Castle, the East of England representative on the Nations and Regions Group, described the Games as a “shop window” for the UK. Both he and the Tourism Management Institute saw scope for the Games to generate business tourism.254 The DCMS memorandum stated that “experience from recent host cities indicates that tourism will increase significantly across the UK, most notably after the Games”;255 and the Tourism Alliance told us that DCMS expected that up to 80% of the legacy benefit to be derived from hosting the Games would be gained through “increased tourism as a result of [the] high degree of international media exposure”.256 The Tourism Alliance itself agreed that the main way that lasting benefits would be reaped would be through media exposure; but it saw Government investment in a tourism strategy as being a necessary part of drawing on that exposure; and it spoke of a “lack of realisation within DCMS that additional funds need to be committed … to marketing and media support”. The Government has pledged that the interests of tourism “will be taken into account in all Olympic policy decisions”; underlying this pledge, however, was a statement by the Secretary of State that, in order to increase the number of visitors as a result of the Games, the tourism industry needed “to improve the consistency of its quality, raise the level of skill and, through imaginative marketing, showcase Britain’s heritage and its dynamic, 21st century cities”.257 0.475559
19 11848 10 Report_and_Minutes_-_Jan_2007_-_vol_2_-_evidence.pdf 2.2 We welcome the opportunity to support the Games and are already investing in some of the keyinstitutions which will deliver a high quality Games. Institutions such as Birmingham Museum and ArtGallery, Bristol Museum and Art Gallery, Tyne and Wear Museums and the Museum of London have allreceived funding through Renaissance in the Regions. Renaissance is MLA’s programme for thetransformation of England’s regional museums. It is the first central government investment of its kind formuseums, and presents a structure through which a co-ordinated oVer amongst regional museums could bedeveloped and resources directed to support. It is therefore crucial that this existing investment is sustained,particularly as we enter a tight funding round in the 2007 comprehensive spending review. Now is the timeto build on and develop this successful programme, which if cut will severely curtail the capacity of themuseums sector to support and deliver the Cultural Olympiad and develop the UK tourism oVer. It shouldbe noted there is no national funding programme for archives, and whilst Framework for the Future oVersa programme for public libraries its focus is on improving the library service and the repositioning of publiclibraries, and does not fund organisations directly 0.448851

5.3.5. N most prototypical paragraphs of a specific topic

Select a topic, print the N paragraphs with the highest probability that they belong to the topic.

In [50]:
topic_id_chosen = 2                                    # choose the topic ID
num_para = 2                                            # set N to extract the N most prototypical paragraphs of a specific topic
df_n_topic_k = top_n_filter(df_topic_para2, num_para)
topic_id_filter = df_n_topic_k['topic_id'] == topic_id_chosen
df_n_topic_k[topic_id_filter].style.set_properties(subset = ['paragraph'], **{'width':'500px', 'length': '50px'})
Out[50]:
Index topic_id file paragraph probability
2 17552 2 Written_evidence_submitted_by_Vision_2020_UK_-_Jan_2010.pdf 4. Most clubs seem ill-prepared for enquiries from, and inclusion of, people with disabilities who wish to participate in the sport offered by the club. There is little or no support for specialist clubs who provide opportunities for sport that cannot be integrated ie wheelchair basketball or blind cricket! There is often nothing locally to support the child or their parent in accessing the specialist provision and this often involves their having to undertake extensive travel to specialist facilities or organisations catering for this group. There is poor information regarding availability etc and http://www.parasport.org.uk was established as a portal to provide pathway and provision information. The then Mayor of London published a strategy in 2007, which highlighted all these issues and to date there has been little, if any, action to redress these anomalies in London or in the rest of the country. DCSF have, in my view, shown no leadership regarding the legacy of 2012 and its impact on PE in schools and the inclusion of those with disabilities in core curriculum activities or sporting opportunities within or after school. DCMS held a legacy event in 2008 and again in April 2009 focusing on the legacy of the Games for those with disabilities. One outcome was to seek greater links between DWP, DCSF and themselves to ensure that joint strategies were developed and pathways established that enabled children to enjoy and participate in PE and sport within schools/after school clubs/integrated and specialist provision in the community, with good national talent forums and pathways established for those who wish to participate in sport at a higher level and finally, with governing bodies having clear inclusive programmes for sports men and women with disabilities active at a national and international level 0.676578
3 17551 2 Written_evidence_submitted_by_Vision_2020_UK_-_Jan_2010.pdf 3. After school activities often exclude this group of children ie no possibility of inclusion in team sports such as football, cricket, rugby, basket ball etc., limited access to swimming baths, athletic fields etc. All of these sports are undertaken by people with a disability, but not now normally at school. Specialist schools did provide a massive range of sporting opportunities and sport played a major part of my own adjustment to disability. I learned about teamwork, was able to set individual goals, have competition to extend my abilities, occasionally experienced "being a winner" and had that thrill of competition. One example of efforts to redress this issue within a school setting is in Leeds, which has a programme of monthly sporting and physical activities arranged in school time for vision impaired children within the schools. One solution easily achieved would be for groups of schools to come together monthly to provide sporting and recreational activities for disabled children and young people within their schools 0.627952

5.4 Topic Dashboard

Below the visualisation of PyLDAvis and the prototypical paragraphs are integrated into a dashboard, users can click the link generated to open the dashboard and explore the topics more easily. To launch the dash, remember to download the two css files from https://github.com/suhao3123/CSS, create a folder named assets in the root of your app directory and include the two files in that folder. After the first run of the whole program, users can run the chunks below independently.

In [58]:
import pandas as pd
import numpy as np
import plotly.express as px 
import plotly.graph_objects as go

from jupyter_dash import JupyterDash

import dash
import dash_table
import dash_core_components as dcc
import dash_html_components as html
import dash_bootstrap_components as dbc
from dash_table.Format import Format, Scheme, Trim
from dash.dependencies import Input, Output, State
from dash.exceptions import PreventUpdate
In [59]:
# load the topic distribution of paragraphs from disk
df_topic_para3 = pd.read_pickle('./df_topic_para_Olympics2.pkl')
df_topic_para3_n = df_topic_para3.copy()
df_topic_para3_n['highest_p'] = df_topic_para3_n.iloc[:, 6:].max(axis = 1)         
df_topic_para3_n['salient_topic'] = df_topic_para3_n.iloc[:, 6:].idxmax(axis = 1)  
df_topic_para3_n = df_topic_para3_n[['index','file_name','salient_topic','paragraphs','highest_p',]]
df_topic_para3_n.columns = ['Index','file','topic', 'paragraph','probability']

# define the function for extracting the highest N ranked paragraphs from each topic
def top_n_filter(df, top_n):
    list_topic_id = [x+1 for x in range(0,k)]
    list_n_para = []
    list_n_p = []
    list_n_index = []
    list_n_file_name = []
    for x in range(1, k + 1): 
        n_para = [i for i in df.nlargest(top_n, [x])['paragraphs']]
        n_p = [i for i in df.nlargest(top_n, [x])[x]]
        n_index = [i for i in df_topic_para3.nlargest(top_n, [x]).index]
        n_file_name = [i for i in df.nlargest(top_n, [x])['file_name']]
        list_n_para.append(n_para)
        list_n_p.append(n_p)
        list_n_file_name.append(n_file_name)
        list_n_index.append(n_index)
    pd_n_para = pd.DataFrame({'Index':list_n_index, 'topic_id': list_topic_id, 'file': list_n_file_name, 'paragraph': list_n_para, 'probability': list_n_p})
    return(pd_n_para.apply(pd.Series.explode).reset_index().drop('index', axis = 1))

list_mark = list(np.arange(0,1.050,0.050))
list_mark_round = [round(i, 2) for i in list_mark]
marks= {x: str(x) for x in list_mark_round}

# Set up the app
external_stylesheets = [dbc.themes.BOOTSTRAP, "assets/bootstrap.min.css"]
app = JupyterDash(__name__, external_stylesheets=external_stylesheets)

# Bootstrap's cards provide a flexible content container with multiple variants and options.
pyLDAcard = dbc.Card(
    [
            dbc.CardHeader(html.H4("Topic visualisation")),                # title
            dbc.CardBody(
            [
                dbc.Row(
                        dbc.Col(
                            [
                                html.Embed(src = "assets/lda.html" ,style={ 'position': 'relative', 'left': '-250px', 'top': '-100px',
                                                                            'width':'1400px', 'height':'860px', 'transform': 'scale(0.70)'}), 
                            ]
                        )
                )
            ]
        ),
    ]
)


table_card = dbc.Card(
    [
        dbc.CardHeader(
            dbc.Row([
                  dbc.Col(html.H4("Prototypical paragraphs"))
            ])            
        ),
        
        
        dbc.CardHeader(
                         dbc.Row(
                            [
                            dbc.Col(
                            [
                                html.H6("Threshold of probability "),
                                dcc.Slider(
                                            id='slider',
                                            min=0,
                                            max=1,
                                            step=0.01,
                                marks=marks,
                                        value=0.05,
                                        ),html.Div(style={'width': '1000px'})
                            ]
                        ),
                        dbc.Col(
                        [
                                html.H6("Topic ID"),
                                dcc.Input(id="topic_selection", type="number",min=1, max=100, step=1, value=1),
                                html.Div(style={'width': '100px'})
                        ]
                        ),
                          dbc.Col(
                        [
                                html.H6("Number of paragraphs"),
                                dcc.Input(id="rank_selection", type="number",min=1, max=1000, step=1,value=5),
                                html.Div(style={'width': '100px'})
                        ]
                        ),
                                dbc.Col(
                                    [
                                        html.H6("Mode"),
                                        dcc.Dropdown(
                                                            id='dropdown',
                                                            options=[
                                                                {'label': 'N most prototypical paragraphs for topic K', 'value': 'c1'},
                                                                {'label': 'N most prototypical paragraphs overall', 'value': 'c2'},
                                                                {'label': 'N most prototypical paragraphs for each topic', 'value': 'c3'}
                                                            ],
                                               #             value = 'c1',
                                                            searchable=False,
                                                            clearable=False,
                                                            placeholder="Select a mode",
                                                        ),html.Div(style={'width': '380px'})
                                    ]
                                ),                                
                ]
            )                    
                ),
        
        dbc.CardBody(
                dbc.Col([
                    dash_table.DataTable(),html.Div(id="data_table")           
                ])    
                ),
        
        dbc.CardFooter(
            dbc.Row([
                dbc.Col(
                                    [
                                        html.H6('Please click the "Submit" button after setting the parameters above'),html.Div(style={'width': '500px'})

                                    ]
                                ),
                                
                dbc.Col(
                                    [
                                        dbc.Button("Submit", id='submit', color="success"),
                                        html.Div(id='button')
                                    ]
                                )
                ])
                )       
    ]
)
        
app.layout = html.Div(
    [
        dbc.Container(
            [dbc.Row(
                [
                dbc.Col(pyLDAcard,md=7), 
                dbc.Col(table_card,md=5)
            ]             
            )
            ],
            fluid=True,
        ),
    ]
)

@app.callback(
    Output('data_table','children'),
    Input('submit', 'n_clicks'), Input('dropdown', 'value'), Input('slider', 'value'), Input('topic_selection','value'), Input('rank_selection','value')
     )

def update_datatable(n_clicks, dropdown_value, slider_value,topic_value,top_n):
  
    ctx = dash.callback_context
    if not ctx.triggered:
        button_id = 'No clicks'
    else:
        button_id = ctx.triggered[0]['prop_id'].split('.')[0]
    
#    print(button_id)
                 
    if button_id=="submit":
        topic = topic_value                        #Topic filter of the Highest ranked paragraphs
        Top_N = top_n                              #Set rank of for for topic 
#        print(topic_value)
#        print(Top_N)   
        
        minimum_probability = slider_value #Topics with an assigned probability lower than this threshold will be discarded.
#        print(minimum_probability)            

        if dropdown_value=='c1':
            c_df = top_n_filter(df_topic_para3, Top_N)[top_n_filter(df_topic_para3, Top_N)['topic_id'] == topic][top_n_filter(df_topic_para3, Top_N)['probability'] >= minimum_probability]
        elif dropdown_value=='c2':
            c_df = df_topic_para3_n.nlargest(Top_N,['probability'])[df_topic_para3_n['probability'] >= minimum_probability]
        elif dropdown_value=='c3':
            c_df = top_n_filter(df_topic_para3, Top_N)[top_n_filter(df_topic_para3, Top_N)['probability'] >= minimum_probability]
        else:
            return None
#        print(dropdown_value)
        
        table = dash_table.DataTable(
                                    id="table-line-1",
                                    columns=[
                                                dict(id=c_df.columns[0], name=c_df.columns[0]),
                                                dict(id=c_df.columns[1], name=c_df.columns[1]),
                                                dict(id=c_df.columns[2], name=c_df.columns[2]),
                                                dict(id=c_df.columns[3], name=c_df.columns[3]),
                                                dict(id=c_df.columns[4], name=c_df.columns[4], type='numeric', format=Format(precision=2, scheme=Scheme.fixed)),             
                                            ],
                                    data=c_df.to_dict("records"),
                       #             page_action='none',
                                    page_size=5,
                                    style_table={'height': '1000px', 'overflowY': 'auto'},
                                    fixed_rows={'headers': True},
                                    style_header={ 'border': '1px solid black', 'fontWeight': 'bold','textAlign': 'center', 'fontSize':'1px'},
                                    style_cell={  'fontSize':'10px','border': '1px solid grey','minWidth': 10, 'maxWidth': 30, 'width': 30,'whiteSpace': 'normal',
                                                'height': 'auto', 'lineHeight': '15px','textAlign': 'center','textOverflow': 'ellipsis', 'maxWidth': 0},
                                    css=[{
                                            'selector': '.dash-spreadsheet td div',
                                            'rule': '''
                                                line-height: 15px;
                                                max-height: 300px; min-height: 50px; height: 300px;
                                                display: block;
                                                overflow-y: hidden;
                                            '''
                                        }],
                                     style_cell_conditional=[
                                                                    {'if': {'column_id': 'Index'},'width': '5%'},
                                                                    {'if': {'column_id': 'file'},'width': '10%' },
                                                                    {'if': {'column_id': 'topic_id'},'width': '5%' },
                                                                    {'if': {'column_id': 'paragraph'},'width': '75%','textAlign': 'left'},
                                                                    {'if': {'column_id': 'probability'},'width': '5%'},
                                                                    
    
                                                                ],
    
                                    style_as_list_view=True,
                 )
#        print('end')
        return table
          
app.run_server(mode = 'external', port=8053)
import warnings
warnings.filterwarnings('ignore')
Dash app running on http://127.0.0.1:8053/
In [ ]:
# remove the hash below and run the chunk to terminate the Dash
#app._terminate_server_for_port('localhost', 8050)
In [ ]: